如何加快语言 - 工具图书馆用例

发布于 2025-02-04 08:36:13 字数 859 浏览 3 评论 0原文

我有一个带有300万行社交媒体评论的熊猫数据框架。我正在使用 language-tool-tool-python 库来找到语法数量评论中的错误。 AFAIK默认情况下,语言工具库将在您的计算机上设置本地语言服务器,并从中询问响应。

获得语法错误的数量仅包括创建语言工具对象的实例,并使用要检查为参数的字符串来调用.check()方法。

>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2

因此,我使用的方法是df ['body_num_errors'] = df ['hody']。现在我很确定这有效。它很简单。在过去的一个小时中,这条代码一直在运行。

由于运行上述示例需要10-20秒,因此在300万个实例中,它几乎可能永远花费。

有什么办法可以减少损失并加快这一过程?会在每一行上迭代并将整个东西放在

我对如何加速此过程的任何建议都开放,如果上述方法有效,如果有人可以向我展示一些示例代码,我会很感激。

编辑 - 更正。

它与实例化需要10-20秒,称该方法几乎是瞬时的。

I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.

Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.

>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2

So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.

Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.

Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.

I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.

edit - Correction.

It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏见 2025-02-11 08:36:13

我是Lagans_tool_python的创建者。首先,这里的评论没有任何意义。瓶颈在tool.check()中;使用pd.dataframe.map()没有什么缓慢的。

Languagetool在计算机上的本地服务器上运行。至少有两种主要方法可以加快这一点:

方法1:初始化多个服务器

servers = []
for i in range(100):
  servers.append(language_tool_python.LanguageTool('en-US'))

,然后从另一个线程调用每个服务器。或者,或者在其自己的线程中初始化每个服务器。

方法2:增加线程计数

Languagetool采用maxCheckThreads选项 - 请参阅LT httpserverconfig 文档 - 因此您也可以尝试使用它吗? From a glance at LanguageTool's source code, it looks like the

I'm the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().

LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:

Method 1: Initialize multiple servers

servers = []
for i in range(100):
  servers.append(language_tool_python.LanguageTool('en-US'))

Then call to each server from a different thread. Or alternatively initialize each server within its own thread.

Method 2: Increase the thread count

LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.

烟酒忠诚 2025-02-11 08:36:13

在文档中,我们可以看到Language-tool-python具有配置选项MaxSpellingSuggestions

但是,尽管变量的名称和默认值为0,但我注意到该参数实际上设置为1 。

我不知道这种差异的来源,并且文档没有提及有关默认行为的任何具体内容。但是,这是一个事实(至少对于我自己的数据集,我认为这不会影响运行时间),此设置可以改善性能。

示例初始化:

import language_tool_python

language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})

In the documentation, we can see that language-tool-python has the configuration option maxSpellingSuggestions.

However, despite the name of the variable and the default value being 0, I have noticed that the code runs noticeably faster (almost 2 times faster) when this parameter is actually set to 1.

I don't know where this discrepancy comes from, and the documentation does not mention anything specific about the default behavior. It is a fact however that (at least for my own dataset, which I don't think can affect this much the running time) this setting improves the performance.

Example initialization:

import language_tool_python

language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})
一刻暧昧 2025-02-11 08:36:13

如果您担心使用Pandas扩展,请改用Dask。它与大熊猫集成在一起,并将在您的CPU中使用多个内核,我假设您拥有的,而不是Pandas使用的单核。这有助于平行300万个实例,并可以加快执行时间。您可以阅读有关dask 在这里

If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.

怀中猫帐中妖 2025-02-11 08:36:13

确保一次创建一个“语言工具的实例”。
然后,对于每行,调用方法“或函数取决于您的代码模式”,其中包括代码逻辑的其余部分

 matches = tool.check(text)
 len(matches)

Make sure to create an instance "instance of the language tool" once.
Then for each row, call method "or function depending on your code pattern" that includes the rest of the code logic

 matches = tool.check(text)
 len(matches)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文