Lucene Porter Stemmer 线程安全吗?

发布于 2024-12-05 06:17:15 字数 634 浏览 2 评论 0原文

快速提问,Lucene 包 (Java) 中的 porter 词干分析器线程安全吗?

我猜答案是否定的,因为您需要设置当前字符串,调用 Stem 方法,然后获取当前块以获取词干单词。但也许我错过了一些东西 - 是否有线程安全的方法可以从 Lucene 中提取单个单词或字符串?

有经验的人是否知道实例化一个 Porter Stemmer 实例然后在该 Stemmer 实例上使用同步块并执行 setCurrent("..."); 是否更快?干(); get(); 例程,还是为要处理的每个字符串/文档创建一个新的 porter 词干分析器实例更快。

在本例中,我有数千个文档,每个文档都由一个线程池占用(即 1 个线程有一个文档)。

编辑仅供参考 - 使用模式示例:

import org.tartarus.snowball.ext.PorterStemmer;
...
private String stem(String word){
       PorterStemmer stem = new PorterStemmer();
       stem.setCurrent(word);
       stem.stem();
       return stem.getCurrent();
    }

干杯!

Quick question, is the porter stemmer from Lucene packages (Java) thread safe?

I'm guessing the answer is no as you need to set the current string, invoke stem method then get the current block to get the stemmed word. But perhaps I'm missing something - Is there are thread safe method to do stemming of a single word or string from Lucene?

Does anyone from experience know if it is faster to instantiate one Porter Stemmer instance and then use a synchronized block over that stemmer instance and do the setCurrent("..."); stem(); get(); routine or is it just faster to create a new porter stemmer instance for each string/document you want to process.

In this instance I have many 1000s of documents which are each taken up by a pool of threads (i.e. 1 thread has one document).

Edit FYI - Example usage pattern:

import org.tartarus.snowball.ext.PorterStemmer;
...
private String stem(String word){
       PorterStemmer stem = new PorterStemmer();
       stem.setCurrent(word);
       stem.stem();
       return stem.getCurrent();
    }

Cheers!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半枫 2024-12-12 06:17:15

查看文档,似乎 < code>PorterStemmer 类是不可重入的,所以如果我是你,我会为每个线程构建一个实例。如果词干分析是程序所做的主要事情之一,并且没有其他方法让 CPU 核心保持忙碌,那么同步块似乎是一个坏主意:程序将一直阻塞,等待词干分析器完成一份文件。我也不会为每个文档创建一个线程;每个核心一个线程的线程池可能是一个更明智的选择。

(没有示例代码,因为我什至无法从 API 文档中找出用法。RTFS 来了解这个东西是如何工作的......)

Looking at the docs, it seems the PorterStemmer class is not re-entrant, so I'd build an instance per thread if I were you. If stemming is one of the main things your program does, and it has no other way of keeping your CPU cores busy, then a synchronized block seems like a bad idea: the program would be blocking all the time, waiting for the stemmer to finish one document. I wouldn't create a single thread per document, either; a thread pool with one thread per core might be a wiser choice.

(No example code since I couldn't even figure out the usage from the API docs. RTFS to find out how this thing works...)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文