使用 Symspell 复合进行并行处理

发布于 2025-01-09 00:19:02 字数 1598 浏览 6 评论 0原文

我是 NLP 相关任务的新手，我正在使用 Pandas (Python) 执行此操作，但想法是每一行都有一个我试图执行拼写纠正器的文本（句子长度可能会有所不同），并且总的 pandas 数据帧是目前略高于约 100 万条记录，未来可能会增加。

最初，我想直接通过 Pandas 的 apply 函数使用 symspell Lookup_compound，但花了很长时间（> 12 小时），仍然没有结果。

 def symspell_compound(input_term, max_edit_distance=2):   
      suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
      for suggestion in suggestions:   
          return suggestion.term
df['text_data'].apply(symspell_compound)

然后我遇到了 joblib 的并行函数，我找不到太多关于它的例子，但它似乎适用于列表。因此，将 text_data 提取到列表中后，我将 Parallel() 与 symspell_compound 函数一起应用，但处理仍然很慢（请参阅下面的代码详细打印）。

text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])

这是我在 1000 条记录的样本上尝试时的代码详细打印输出。

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   52.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  6.6min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 11.9min

关于出了什么问题（例如函数参数等）的任何想法，或者我怎样才能更有效地做到这一点？提前致谢。

旁注：我在 CDSW 工作台中使用 4CPU 和 8GB 内存执行此操作（因为这是迄今为止允许的最大值）

原文

I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.

Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.

 def symspell_compound(input_term, max_edit_distance=2):   
      suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
      for suggestion in suggestions:   
          return suggestion.term
df['text_data'].apply(symspell_compound)

Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).

text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])

This is the code verbose printout when I tried it on a sample of 1000 records.

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   52.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  1.7min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  6.6min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed:  9.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 11.9min

Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.

Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如日中天 2025-01-16 00:19:02

就性能而言，Python 可能不是最佳选择。使用 C# 实现 LookupCompound 在 2012 Macbook 上单核可以达到 5000 字/秒。（参见https://seekstorm.com/blog/sub -毫秒复合感知-自动.拼写校正/）。以下具有 Python 绑定的端口之一可能有助于将性能提高几个数量级：

Rust 端口 https://github .com/reneklacan/symspell

Rust 端口的 Python 绑定 https://github.com/zoho-labs/symspell

C++ 端口的 Python 绑定

原始 C# 版本 https://github.com/wolfgarbe/symspell

回复收藏 0 原文

~没有更多了~