纯统计引擎,还是自然语言处理引擎?
有哪些统计引擎可以产生比 OpenNLP 工具套件更好的结果(如果有)?我正在寻找的是一个引擎,可以从文本中选择关键字并提供对这些动词和词干的提取。名词,也许自然语言处理不是这里的出路。该引擎还应该支持不同的语言。
What are the statistical engines that yield better results than the OpenNLP suite of tools, if any? What I'm looking for is an engine that picks keywords from texts and provides stemming on those verbs & nouns, perhaps Natural Language Processing is not the way to go here. The engine should also work with different languages.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可能正在寻找 Snowball 项目,该项目为多种不同语言开发了词干分析器。
You're probably looking for the Snowball project, which has developed stemmers for a number of different languages.
作为完整的 NLP 工具,LingPipe 可能值得一看。
但是,如果您需要做的只是找到动词和名词并提取它们的词干,那么您可以
1) 标记文本
2) 运行 POS 标记器
3)运行词干分析器
我相信斯坦福大学的工具可以对多种语言执行此操作,NLTK 将是一种快速尝试的方法。
但是,您要小心,不要只关注动词和名词 - 您如何处理名词短语和多词名词?理想情况下,nlp 包可以处理这个问题,但这很大程度上取决于您正在工作的领域。不幸的是,很多 NLP 取决于您的数据有多好。
LingPipe is probably worth a look as complete NLP tool.
However, if all you need to do is find verbs and nouns and stem them, then you could just
1) tokenize text
2) run a POS tagger
3) run a stemmer
The Stanford tools can do this for multiple languages I believe, and NLTK would be a quick way to try it out.
However, you want to be careful of just going after verbs and nouns- what do you do about noun phrases and multiword nouns? Ideally an nlp package can handle this, but a lot of it depends on the domain you are working in. Unfortunately a lot of NLP is how good your data is.
如果您正在寻找 Java 代码,我可以推荐斯坦福大学的工具集。他们的词性标注器适用于英语、德语、中文和阿拉伯语(尽管我只用它来英语)并包括(仅限英语)词形还原器。
这些工具都是免费的,准确性相当高,而且速度对于基于 Java 的解决方案来说也不算太差;主要问题有时是不稳定的 API 和高内存使用率。
If you're looking for Java code, I can recommend Stanford's set of tools. Their POS tagger works for English, German, Chinese and Arabic (though I only used it for English) and includes an (English-only) lemmatizer.
These tools are all free, accuracy is pretty high and the speed is not too bad for a Java-based solution; the main problems are sometimes flaky APIs and high memory use.
我对 TreeTagger 有很好的经验:
http://www.ims.uni-stuttgart .de/projekte/corplex/TreeTagger/
它很容易使用,比斯坦福大学的更快,并且属于“优秀”的词干分析器/标记器。它同时执行所有操作:标记化/词干提取/标记。
I had good experience with TreeTagger:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
It's easy to use, faster than the Stanford's one, and belongs to the "good" stemmers/taggers out there. It does all operations at once: tokenization/stemming/tagging.