Java 中有用于文本分析/挖掘的 API 吗?

发布于 2024-11-26 04:45:39 字数 263 浏览 4 评论 0原文

我想知道Java中是否有一个API可以进行文本分析。可以提取文本中所有单词、单独的单词、表达式等的东西。可以告知找到的单词是否是数字、日期、年份、名称、货币等的东西。

我现在开始文本分析,所以我只需要一个 API 即可启动。我做了一个网络爬虫,现在我需要一些东西来分析下载的数据。需要方法来计算页面中的单词数、相似单词、数据类型以及与文本相关的其他资源。

Java 中有用于文本分析的 API 吗?

编辑:文本挖掘,我想挖掘文本。提供此功能的 Java API。

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.

I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.

Are there APIs for text analysis in Java?

EDIT: Text-mining, I want to mining the text. An API for Java that provides this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

ゞ花落谁相伴 2024-12-03 04:45:39

您似乎正在寻找命名实体识别器

你有几个选择。

来自斯坦福自然语言处理组的 CRFClassifier 是 Named 的 Java 实现实体识别器。

GATE(文本工程通用架构),一个用于语言处理的开源套件。看一下开发者页面的截图:http://gate.ac.uk/ family/developer.html。它应该让您简要了解这可以做什么。 视频教程< /a> 让您更好地了解该软件的功能。

您可能需要自定义其中之一以满足您的需求。

您还有其他选择:


关于 CRFClassifier 的培训,您可以在他们的常见问题解答中找到简要说明:

...训练数据应该位于制表符分隔的列中,并且您
通过地图定义这些列的含义。一列应该是
称为“答案”并具有 NER 类,并且现有功能已知
关于“word”和“tag”等名称。您定义数据文件、地图、
以及通过属性文件生成哪些功能。有
大量关于不同属性特征的文档
在 NERFeatureFactory 的 Javadoc 中生成,尽管最终你
必须去源代码来回答一些问题...

您还可以在 CRFClassifier的javadoc

典型的命令行用法

用于在提供的序列化分类器上运行经过训练的模型
文本文件:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile Samplesentences.txt

在属性文件中指定所有参数(训练、测试或
运行时):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

从命令行训练和测试简单的 NER 模型:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > >输出

It looks like you're looking for a Named Entity Recogniser.

You have got a couple of choices.

CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.

GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.

You may need to customise one of them to fit your needs.

You also have other options:


In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:

...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...

You can also find a code snippet at the javadoc of CRFClassifier:

Typical command-line usage

For running a trained model with a provided serialized classifier on a
text file:

java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt

When specifying all parameters in a properties file (train, test, or
runtime):

java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile

To train and test a simple NER model from the command line:

java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output

甜心 2024-12-03 04:45:39

例如 - 您可以使用标准库 java.text 中的一些类,或使用 StreamTokenizer (您可以根据您的要求自定义它)。但正如您所知 - 来自互联网来源的文本数据通常有许多拼写错误,为了获得更好的性能,您必须使用模糊标记器之类的东西 - java.text 和其他标准实用程序在这种情况下的功能太有限

因此,我建议您使用正则表达式 (java.util.regex) 并根据您的需要创建自己类型的标记生成器。

PS >
根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分。您可能会在下图中看到简单的状态机识别器(您可以构建更高级的解析器,它可以识别文本中更复杂的模板)。

在此处输入图像描述

For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.

So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.

P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).

enter image description here

泪痕残 2024-12-03 04:45:39

如果您正在处理大量数据,也许 Apache 的 Lucene 将帮助您满足您的需求。

否则,最简单的方法可能是创建自己的很大程度上依赖于标准模式类的分析器类。这样,您就可以控制哪些文本被视为单词、边界、数字、日期等。例如,20110723 是日期还是数字?您可能需要实现多遍解析算法以更好地“理解”数据。

If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.

Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.

哆兒滾 2024-12-03 04:45:39

我也建议查看 LingPipe。如果您对网络服务感到满意,那么这篇文章有一个很好的总结不同的API

I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs

深爱不及久伴 2024-12-03 04:45:39

我宁愿采用 Lucene 的 Analysis 和 Stemmer 类,而不是重新发明轮子。他们涵盖了绝大多数案件。另请参阅附加类和贡献类。

I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文