java关键词提取

发布于 2024-11-05 07:27:11 字数 365 浏览 3 评论 0原文

是否有一个简单易用的 Java 库,可以接受一个字符串并返回一组字符串,这些字符串是关键字/关键短语。

它不必特别聪明,只需使用停用词和词干来匹配关键字即可。

我正在查看 KEA 包 http://code.google.com/p/kea-algorithm / 但我不知道如何使用他们的代码。

理想情况下,一些简单的、有一些示例文档的东西会很好。与此同时,我将开始自己写这篇文章!

编辑:当我说我看不到如何弄清楚如何使用他们的代码时,我的意思是我看不到一种简单的方法。各个类本身都有有用的方法来完成大部分工作。

Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.

It doesn't have to be particularly clever, just use stop words and stemming to match keywords.

I am looking at the KEA package http://code.google.com/p/kea-algorithm/ but I can't figure out how to use their code.

Ideally something simple which has a little example documentation would be good. In the meantime I will set about writing this myself!

EDIT: When I say I can't see how to figure out how to use their code, I mean I can't see a simple way. The individiual classes by themselves have useful methods that will do much of the work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

┾廆蒐ゝ 2024-11-12 07:27:11

这是一个相当老的问题,可能 OP 已经解决了他的问题,但将其放在这里是为了其他可能偶然发现这个问题寻找如何使用 KEA 的人。

对于 KEA,您将需要一个培训集 - 您的某些文档需要已经设置关键字。训练数据由文档目录(.txt 文件)和相应的关键字文件(.key 文件)组成,每行一个关键字。您在此集合上训练 KEA,然后使用该模型提取其余文档(位于 .txt 文件的另一个目录中)的关键字。 KEA 会在此目录中写出相应的.key 文件。

有关更多信息,请查看以下一项或多项:

1) KEA 源代码发行版有一个 TestKEA.java 类,它展示了如何从小型测试语料库中提取关键字。自述文件包含有关所需目录格式的详细信息。

2) 这篇博文有(在我看来有点简洁)关于如何使用 KEA 的说明。

http://kea-pranay.blogspot.com/2010 /02/kea-key-extraction-algorithm.html

3) 我上周末写的博客文章,当时我试图学习如何从我拥有的语料库中生成关键字(已经用关键字手动注释)。它具有用于按照 KEA 期望的方式预处理数据的 Python 代码、用于训练和运行提取器的 Scala(KEA 提供 Java API)代码以及用于分析和可视化生成的关键字的 Python 代码。

http://sujitpal.blogspot.com/2014/08/keyword -extraction-with-kea.html

This is a fairly old question and probably the OP has already solved his problem, but putting it here for others who may stumble upon the question looking for how to use KEA.

For KEA, you will need a training set - some of your documents will need to have keywords already set. The training data consists of a directory of documents (.txt files) and corresponding keywords files (.key files), with one keyword per line. You train KEA on this set, then use the model to extract keywords on the rest of your documents, which are in another directory of .txt files. KEA will write out corresponding .key files in this directory.

For more information, take a look at one or more of the following:

1) The KEA source distribution has a TestKEA.java class which shows how to extract keywords from a small test corpus. The README has details on the directory format required.

2) This blog post has (a somewhat terse IMO) instructions on how to use KEA.

http://kea-pranay.blogspot.com/2010/02/kea-key-extraction-algorithm.html

3) My blog post which I wrote up last weekend while trying to learn how to generate keywords from a corpus I had (which were already manually annotated with keywords). It has Python code to pre-process data to the way KEA expects it, Scala (KEA provides a Java API) code to train and run the extractor, and Python code to do analyze and visualize the generated keywords.

http://sujitpal.blogspot.com/2014/08/keyword-extraction-with-kea.html

思慕 2024-11-12 07:27:11

您可以尝试 Porter Stemming 算法:java 版本位于 http://tartarus.org/~ martin/PorterStemmer/java.txt 主页位于 http://tartarus.org/~martin/PorterStemmer/。它很旧,但做工还不错。

You might try the Porter Stemming algorithm: the java version is at http://tartarus.org/~martin/PorterStemmer/java.txt and the main page is at http://tartarus.org/~martin/PorterStemmer/. Its old, but doesn't do a bad job.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文