JAVA中使用哪个NLP工具包?
我正在开发一个项目,该项目包含一个连接到 NCBI(国家生物技术信息中心)并在那里搜索文章的网站。 问题是我必须对所有结果进行一些文本挖掘。 我使用 JAVA 语言进行文本挖掘,并使用 AJAX 和 ICEFACES 来开发网站。 我有什么: 从搜索返回的文章列表。 每篇文章都有一个 ID 和一个摘要。 这个想法是从每个抽象文本中获取关键字。 然后比较所有摘要中的所有关键字,找到重复次数最多的关键字。 然后在网站中显示相关的搜索词。 有任何想法吗 ? 我在网上搜索了很多,我知道有命名实体识别,词性标记,有关于基因和蛋白质的 NER 的 GENIA 同义词库,我已经尝试过词干提取...停用词列表等... 我只需要知道解决这个问题的最佳方法。 多谢。
i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results.
I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website.
What do I have :
A list of articles returned from a search.
Each article has an ID and an abstract.
The idea is to get keywords from each abstract text.
And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search.
Any ideas ?
I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc...
I just need to know the best aproahc to resolve this problem.
Thanks a lot.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我建议您使用词性标记和字符串标记的组合来从每个摘要中提取所有名词。然后使用某种字典/哈希来计算每个名词的频率,然后输出 N 个最多产的名词..将其与其他一些智能过滤机制相结合应该可以很好地为您提供摘要中的重要关键字
对于 POS 标记,请查看 POS 标记器,网址为 http://nlp.stanford.edu/software/index。 shtml
但是,如果您期望语料库中存在大量多词术语...而不是仅提取名词,您可以采用最多产的 n-grams for n=2 到 4
i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml
However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4
有一个 Apache 项目...我还没有使用过它,但是 OpenNLP 一个开源 Apache 项目。 它在孵化器里,所以可能有点生。
这篇文章来自 jeff 的搜索引擎咖啡馆还有一些其他建议。
There's an Apache project for that... I haven't used it but, OpenNLP an open source Apache project. It's in the incubator so it maybe a bit raw.
This post from jeff's search engine cafe has a number of other suggestions.
这也可能相关:
https://github.com/jdf/cue.language
它有停用词、word 和 ngram频率,...
它是 Wordle 背后的软件的一部分。
This might be relevant as well:
https://github.com/jdf/cue.language
It has stop words, word and ngram frequencies, ...
It's part of the software behind Wordle.
我最终使用了 Alias`i Ling Pipe
I ended up using the Alias`i Ling Pipe