获取标题java库中的重要单词
是否有任何java库可以通过给定的文本(标题)获取其中重要单词的集合。
编辑:我所说的重要是指定义了句子的主要思想的那个。 谢谢。
Is there any java library that with given text (title) gets collection of important words in it.
EDITED: By important I mean the one that has define the main idea of the sentence.
Thank You.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能想看看 Apache Mahout。
您可能还想阅读有关 tf-idf 模型 的更多信息,该模型通常是用于与您描述的情况类似的情况。
编辑:有关 Tf-Idf 模型的更多信息:
tf-idf 模型基本上说明了两件事:
] tf-idf 模型利用此假设并根据 tf,idf 值给出每个术语的评级。
要查找 idf 值,您可能需要对集合建立索引或使用某些搜索引擎 API,并根据结果数量估计每个术语的常见程度 [请注意,引擎返回的数字并不准确,但它可以用作粗略估计]
You might want to take a look at Apache Mahout.
You also might want to read more on tf-idf model which is often used for cases similar to the one you describe.
EDIT: more info on Tf-Idf model:
The tf-idf model basically says 2 things:
The tf-idf model utilize this assumptions and gives a rating for each term according to the tf,idf values.
To find the idf value you might want to index your collection or use some search engine API and estimate how common each term is, based on the number of results [note that the number returned by the engine is not exact, but it might be used as a rough estimation]
主题模型尝试对文档(或文档集合)执行此操作。我怀疑你能用单个句子做很多事情。
您可以尝试使用语义解析器(例如 RelEx)来尝试获取主要主题/对象/等等,但它仍然不是那么简单。
您正在尝试做的一些例子会有所帮助。 “定义主要思想”仍然相当模糊 - 您正在处理什么类型的句子?
Topic models try to do this for documents (or collections of documents). I doubt you can do much with individual sentences.
You could try using a semantic parser (eg RelEx) to try to get the main subject/object/etc, but it's still not that straightforward.
Some examples of what you are trying to do would help. "define the main idea" is still pretty vague - what type of sentences are you dealing with?
考虑到您只使用标题,我可以想象几乎任何不是的单词停止词很重要。
也许您只是在寻找基本的停用词删除算法,而不是完整的文本分析算法?
只是取决于你需要这个东西有多复杂或“智能”。
Considering you are working exclusively with titles, I would imagine pretty much any word that is not a stop word is important.
Perhaps you are just looking for a basic stop word removal algorithm, rather than a full blown text analysis algorithm?
Just depends how complex or "smart" you need this thing to be.