高效的关键词检测/提取。预定义的关键字集
如何有效地从字符串中提取相关的关键字?我的关键字列表是预定义的。例如,在一篇关于 Michelle Obama 的文章中也提到了 Barack Obama,我想使用关键字 Michelle Obama
提取 Michelle Obama
和 Barack Obama
获得更高的相关性值(Michelle Obama
和 Barack Obama
都出现在我的关键字列表中)。
检查字符串中每个关键字出现的次数似乎不是很有效。我的应用程序是用 PHP 开发的,但如果我能有效地做到这一点,任何语言都可以。
我尝试了 OpenCalais,但它没有检测到我的大部分关键字。是否可以使用 Lucene 提取关键字?
How can I efficiently extract keywords with relevance from a string? My list of keywords are predefined. For example, in an article about Michelle Obama that also mentions Barack Obama, I want to extract Michelle Obama
and Barack Obama
with the keyword Michelle Obama
getting a higher relevance value (both Michelle Obama
and Barack Obama
are present in my keywords list).
Checking the string for the number of occurrence of each keyword doesn't seem very efficient. My application is developed in PHP, but any language is ok, if I can do this efficiently.
I tried OpenCalais, but it is not detecting most of my keywords. Is it possible to extract keywords using Lucene?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
apache lucene 包会适合你。但是,如果您有标题和段落,您可以过滤掉停用词,为标题中的单词提供更高的排名,然后匹配它们或其在段落中的形式。您可以参考一些文本摘要文章以更好地自己编程。
The apache lucene package will suit you. However if you have title and paragraphs, you can filter out the stop words, give higher ranks for the words in the title and then match them or their forms in the paragraphs.. you can consult some text summarization articles for better programming yourself.