使用 Symspell 复合进行并行处理

发布于 2025-01-09 00:19:02 字数 1598 浏览 6 评论 0原文

我是 NLP 相关任务的新手,我正在使用 Pandas (Python) 执行此操作,但想法是每一行都有一个我试图执行拼写纠正器的文本(句子长度可能会有所不同),并且总的 pandas 数据帧是目前略高于约 100 万条记录,未来可能会增加。

最初,我想直接通过 Pandas 的 apply 函数使用 symspell Lookup_compound,但花了很长时间(> 12 小时),仍然没有结果。

 def symspell_compound(input_term, max_edit_distance=2):   
      suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
      for suggestion in suggestions:   
          return suggestion.term
df['text_data'].apply(symspell_compound)

然后我遇到了 joblib 的并行函数,我找不到太多关于它的例子,但它似乎适用于列表。因此,将 text_data 提取到列表中后,我将 Parallel() 与 symspell_compound 函数一起应用,但处理仍然很慢(请参阅下面的代码详细打印)。

text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])

这是我在 1000 条记录的样本上尝试时的代码详细打印输出。

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   52.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  6.6min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 11.9min

关于出了什么问题(例如函数参数等)的任何想法,或者我怎样才能更有效地做到这一点?提前致谢。

旁注:我在 CDSW 工作台中使用 4CPU 和 8GB 内存执行此操作(因为这是迄今为止允许的最大值)

I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.

Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.

 def symspell_compound(input_term, max_edit_distance=2):   
      suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
      for suggestion in suggestions:   
          return suggestion.term
df['text_data'].apply(symspell_compound)

Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).

text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])

This is the code verbose printout when I tried it on a sample of 1000 records.

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   52.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  6.6min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 11.9min

Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.

Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

如日中天 2025-01-16 00:19:02

就性能而言,Python 可能不是最佳选择。使用 C# 实现 LookupCompound 在 2012 Macbook 上单核可以达到 5000 字/秒。 (参见https://seekstorm.com/blog/sub -毫秒复合感知-自动.拼写校正/)。以下具有 Python 绑定的端口之一可能有助于将性能提高几个数量级:

Rust 端口 https://github .com/reneklacan/symspell

Rust 端口的 Python 绑定 https://github.com/zoho-labs/symspell

C++ 端口的 Python 绑定

原始 C# 版本 https://github.com/wolfgarbe/symspell

Performance-wise Python probably isn't the best choice. Using the C# implementation LookupCompound can reach 5000 words/s, single-core on 2012 Macbook. ( see https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/ ). One of the following ports with Python bindings might help to improve the performance by orders of magnitude:

Rust port https://github.com/reneklacan/symspell

Python bindings for Rust port https://github.com/zoho-labs/symspell

Python bindings for C++ port

Original C# version https://github.com/wolfgarbe/symspell

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文