将 OpenNLP 用于多个文本时加快其 POSTaging 速度

发布于 2024-10-06 18:06:38 字数 1154 浏览 8 评论 0原文

我目前正在开发一个关键短语提取工具，该工具应该为网站上的文本或文档提供标签建议。正如我遵循本文提出的方法：使用神经网络提取关键短语的新方法我使用 OpenNLP 工具包的 POSTagger 进行第一步，即候选选择。

一般来说，关键词提取效果很好。我的问题是，每次我想使用 POSTagger 时，我都必须从相应的文件中执行模型的昂贵加载：

posTagger = new POSTaggerME(new POSModel(new FileInputStream(new File(modelDir + "/en-pos-maxent.bin"))));
tokenizer = new TokenizerME(new TokenizerModel(new FileInputStream(new File(modelDir + "/en-token.bin"))));
// ...
String[] tokens = tokenizer.tokenize(text);
String[] tags = posTagger.tag(tokens);

这是因为该代码不在网络服务器本身的范围内，而是在“处理程序”内” 其生命周期仅包含处理一个特定请求。我的问题是：如何实现仅加载文件一次？（我不想花 10 秒等待模型加载并在之后仅使用 200 毫秒。）

我的第一个想法是序列化POSTaggerME（分别是TokenizerME）并在每次需要时使用Java的内置机制反序列化它。不幸的是，这不起作用——它会引发异常。（我确实序列化了 WEKA 工具包中的分类器，该工具包最后对我的候选者进行了分类，以便不必每次都构建（或训练）分类器。因此我认为这也可能适用于 POSTaggeME。不幸的是，这是情况并非如此。）

对于 Tokenizer，我可以参考一个简单的 WhitespaceTokenizer，这是一个较差的解决方案，但一点也不糟糕：

tokenizer = WhitespaceTokenizer.INSTANCE;

但我没有看到这个选项是可靠的 POSTagger。

原文

I'm currently working on a keyphrase extraction tool, which should provide tag suggestions for texts or documents on a website. As I am following the method proposed in this paper: A New Approach to Keyphrase Extraction Using Neural Networks I am using the OpenNLP toolkit's POSTagger for the first step, i.e. candidate selection.

In general the keyphrase extraction works pretty well. My problem is that I have to perform this expensive loading of the models from their corresponding files every time I want to use the POSTagger:

posTagger = new POSTaggerME(new POSModel(new FileInputStream(new File(modelDir + "/en-pos-maxent.bin"))));
tokenizer = new TokenizerME(new TokenizerModel(new FileInputStream(new File(modelDir + "/en-token.bin"))));
// ...
String[] tokens = tokenizer.tokenize(text);
String[] tags = posTagger.tag(tokens);

This is due to the fact that this code is not on the scope of the webserver itself but inside a "handler" with a lifecycle including only handling one specific request. My question is: How can I achieve loading the files only once? (I don't want to spend 10 seconds on waiting for the models to load and using it just for 200ms afterwards.)

My first idea was to serialize the POSTaggerME (TokenizerME resp.) and deserialize it every time I need it using Java's built-in mechanism. Unfortunately this doesn't work – it raises an exception. (I do serialize the classifier from the WEKA-toolkit which classifies my candidates at the end in order to not having to build (or train) the classifier every time. Therefore I thougth this may be applicable to the POSTaggeME as well. Unfortunately this is not the case.)

In the case of the Tokenizer I could refer to a simple WhitespaceTokenizer which is an inferior solution but not that bad at all:

tokenizer = WhitespaceTokenizer.INSTANCE;

But I don't see this option for a reliable POSTagger.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

三五鸿雁 2024-10-13 18:06:38

只需将您的标记化/POS 标记管道包装在单例中即可。

如果底层 OpenNLP 代码不是线程安全的，请将调用放在同步块中，例如：

// the singletons tokenization/POS-tagging pipeline 
String[] tokens;
synchronized(tokenizer) { 
   tokens = tokenizer.tokenize(text);
}
String[] tags;
synchronized(posTagger) { 
   tags = posTagger.tag(tokens);
}

Just wrap your tokenization/POS-tagging pipeline in a singleton.

If the underlying OpenNLP code isn't thread safe, put the calls in synchronization blocks, e.g.:

// the singletons tokenization/POS-tagging pipeline 
String[] tokens;
synchronized(tokenizer) { 
   tokens = tokenizer.tokenize(text);
}
String[] tags;
synchronized(posTagger) { 
   tags = posTagger.tag(tokens);
}

回复收藏 0 原文

~没有更多了~