使用 Symspell 复合进行并行处理
我是 NLP 相关任务的新手,我正在使用 Pandas (Python) 执行此操作,但想法是每一行都有一个我试图执行拼写纠正器的文本(句子长度可能会有所不同),并且总的 pandas 数据帧是目前略高于约 100 万条记录,未来可能会增加。
最初,我想直接通过 Pandas 的 apply 函数使用 symspell Lookup_compound,但花了很长时间(> 12 小时),仍然没有结果。
def symspell_compound(input_term, max_edit_distance=2):
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
for suggestion in suggestions:
return suggestion.term
df['text_data'].apply(symspell_compound)
然后我遇到了 joblib 的并行函数,我找不到太多关于它的例子,但它似乎适用于列表。因此,将 text_data 提取到列表中后,我将 Parallel() 与 symspell_compound 函数一起应用,但处理仍然很慢(请参阅下面的代码详细打印)。
text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])
这是我在 1000 条记录的样本上尝试时的代码详细打印输出。
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 52.2s
[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 1.7min
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 2.8min
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.9min
[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 5.2min
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.6min
[Parallel(n_jobs=4)]: Done 53 tasks | elapsed: 8.2min
[Parallel(n_jobs=4)]: Done 64 tasks | elapsed: 9.9min
[Parallel(n_jobs=4)]: Done 77 tasks | elapsed: 11.9min
关于出了什么问题(例如函数参数等)的任何想法,或者我怎样才能更有效地做到这一点?提前致谢。
旁注:我在 CDSW 工作台中使用 4CPU 和 8GB 内存执行此操作(因为这是迄今为止允许的最大值)
I am new to NLP related tasks and I'm doing this with Pandas (Python) but the idea is that each row has a text that I'm trying to perform spell corrector on (sentence length may vary) and the total pandas dataframe is slightly over ~ 1 million records currently, likely to increase in the future.
Initially, I thought to use the symspell lookup_compound directly via the apply function with Pandas but it took such a long time (>12hours) and there was still no results.
def symspell_compound(input_term, max_edit_distance=2):
suggestions = sym_spell.lookup_compound(input_term, max_edit_distance)
for suggestion in suggestions:
return suggestion.term
df['text_data'].apply(symspell_compound)
Then I came across the Parallel function with joblib and I wasn't able to find much examples on it but it seems to work on lists. So after extracting the text_data into a list, I applied the Parallel() together with the symspell_compound function but yet the processing was still slow (refer to code verbose print out below).
text_list = df['text_data'].to_list()
test_parallel = Parallel(n_jobs=4, verbose=10)(delayed(symspell_compound)(i) for i in text_list[:1000])
This is the code verbose printout when I tried it on a sample of 1000 records.
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done 5 tasks | elapsed: 52.2s
[Parallel(n_jobs=4)]: Done 10 tasks | elapsed: 1.7min
[Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 2.8min
[Parallel(n_jobs=4)]: Done 24 tasks | elapsed: 3.9min
[Parallel(n_jobs=4)]: Done 33 tasks | elapsed: 5.2min
[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 6.6min
[Parallel(n_jobs=4)]: Done 53 tasks | elapsed: 8.2min
[Parallel(n_jobs=4)]: Done 64 tasks | elapsed: 9.9min
[Parallel(n_jobs=4)]: Done 77 tasks | elapsed: 11.9min
Any ideas on what has gone wrong (e.g. on function parameter etc), or how can I do this more efficiently? Thanks in advance.
Side note: I'm doing this in CDSW workbench using 4CPU and 8GB memory (as this is the max allowed so far)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
就性能而言,Python 可能不是最佳选择。使用 C# 实现 LookupCompound 在 2012 Macbook 上单核可以达到 5000 字/秒。 (参见https://seekstorm.com/blog/sub -毫秒复合感知-自动.拼写校正/)。以下具有 Python 绑定的端口之一可能有助于将性能提高几个数量级:
Rust 端口 https://github .com/reneklacan/symspell
Rust 端口的 Python 绑定 https://github.com/zoho-labs/symspell
C++ 端口的 Python 绑定
原始 C# 版本 https://github.com/wolfgarbe/symspell
Performance-wise Python probably isn't the best choice. Using the C# implementation LookupCompound can reach 5000 words/s, single-core on 2012 Macbook. ( see https://seekstorm.com/blog/sub-millisecond-compound-aware-automatic.spelling-correction/ ). One of the following ports with Python bindings might help to improve the performance by orders of magnitude:
Rust port https://github.com/reneklacan/symspell
Python bindings for Rust port https://github.com/zoho-labs/symspell
Python bindings for C++ port
Original C# version https://github.com/wolfgarbe/symspell