获取标题java库中的重要单词

发布于 2024-12-29 13:05:39 字数 77 浏览 0 评论 0原文

是否有任何java库可以通过给定的文本(标题)获取其中重要单词的集合。
编辑:我所说的重要是指定义了句子的主要思想的那个。 谢谢。

Is there any java library that with given text (title) gets collection of important words in it.
EDITED: By important I mean the one that has define the main idea of the sentence.
Thank You.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

谈下烟灰 2025-01-05 13:05:39

您可能想看看 Apache Mahout

您可能还想阅读有关 tf-idf 模型 的更多信息,该模型通常是用于与您描述的情况类似的情况。

编辑:有关 Tf-Idf 模型的更多信息:

tf-idf 模型基本上说明了两件事:

  1. 如果某个术语在数据中出现多次,则它可能很重要。 [tf]
  2. 如果一个术语在世界上出现多次,则预期它会出现在您的数据中 - 但是,如果它很少出现 - 并且它出现在您的数据中 - 则表明它非常“重要” [idf

] tf-idf 模型利用此假设并根据 tf,idf 值给出每个术语的评级。

要查找 idf 值,您可能需要对集合建立索引或使用某些搜索引擎 API,并根据结果数量估计每个术语的常见程度 [请注意,引擎返回的数字并不准确,但它可以用作粗略估计]

You might want to take a look at Apache Mahout.

You also might want to read more on tf-idf model which is often used for cases similar to the one you describe.

EDIT: more info on Tf-Idf model:

The tf-idf model basically says 2 things:

  1. If a term appears many times in your data, it is probably important. [tf]
  2. If a term appears many times in the world, an appearance of it in your data is expected - however, if it is rare - and it appears in your data - it indicates it is a very "important" [idf]

The tf-idf model utilize this assumptions and gives a rating for each term according to the tf,idf values.

To find the idf value you might want to index your collection or use some search engine API and estimate how common each term is, based on the number of results [note that the number returned by the engine is not exact, but it might be used as a rough estimation]

以为你会在 2025-01-05 13:05:39

主题模型尝试对文档(或文档集合)执行此操作。我怀疑你能用单个句子做很多事情。

您可以尝试使用语义解析器(例如 RelEx)来尝试获取主要主题/对象/等等,但它仍然不是那么简单。

您正在尝试做的一些例子会有所帮助。 “定义主要思想”仍然相当模糊 - 您正在处理什么类型的句子?

Topic models try to do this for documents (or collections of documents). I doubt you can do much with individual sentences.

You could try using a semantic parser (eg RelEx) to try to get the main subject/object/etc, but it's still not that straightforward.

Some examples of what you are trying to do would help. "define the main idea" is still pretty vague - what type of sentences are you dealing with?

玻璃人 2025-01-05 13:05:39

考虑到您只使用标题,我可以想象几乎任何不是的单词停止词很重要。

也许您只是在寻找基本的停用词删除算法,而不是完整的文本分析算法?

只是取决于你需要这个东西有多复杂或“智能”。

Considering you are working exclusively with titles, I would imagine pretty much any word that is not a stop word is important.

Perhaps you are just looking for a basic stop word removal algorithm, rather than a full blown text analysis algorithm?

Just depends how complex or "smart" you need this thing to be.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文