Java 中有用于文本分析/挖掘的 API 吗?
我想知道Java中是否有一个API可以进行文本分析。可以提取文本中所有单词、单独的单词、表达式等的东西。可以告知找到的单词是否是数字、日期、年份、名称、货币等的东西。
我现在开始文本分析,所以我只需要一个 API 即可启动。我做了一个网络爬虫,现在我需要一些东西来分析下载的数据。需要方法来计算页面中的单词数、相似单词、数据类型以及与文本相关的其他资源。
Java 中有用于文本分析的 API 吗?
编辑:文本挖掘,我想挖掘文本。提供此功能的 Java API。
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您似乎正在寻找命名实体识别器。
你有几个选择。
来自斯坦福自然语言处理组的 CRFClassifier 是 Named 的 Java 实现实体识别器。
GATE(文本工程通用架构),一个用于语言处理的开源套件。看一下开发者页面的截图:http://gate.ac.uk/ family/developer.html。它应该让您简要了解这可以做什么。 视频教程< /a> 让您更好地了解该软件的功能。
您可能需要自定义其中之一以满足您的需求。
您还有其他选择:
关于 CRFClassifier 的培训,您可以在他们的常见问题解答中找到简要说明:
您还可以在 CRFClassifier的javadoc:
It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
You can also find a code snippet at the javadoc of CRFClassifier:
例如 - 您可以使用标准库
java.text
中的一些类,或使用StreamTokenizer
(您可以根据您的要求自定义它)。但正如您所知 - 来自互联网来源的文本数据通常有许多拼写错误,为了获得更好的性能,您必须使用模糊标记器之类的东西 - java.text 和其他标准实用程序在这种情况下的功能太有限。因此,我建议您使用正则表达式 (java.util.regex) 并根据您的需要创建自己类型的标记生成器。
PS >
根据您的需要 - 您可以创建状态机解析器来识别原始文本中的模板部分。您可能会在下图中看到简单的状态机识别器(您可以构建更高级的解析器,它可以识别文本中更复杂的模板)。
For example - you might use some classes from standard library
java.text
, or useStreamTokenizer
(you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
如果您正在处理大量数据,也许 Apache 的 Lucene 将帮助您满足您的需求。
否则,最简单的方法可能是创建自己的很大程度上依赖于标准模式类的分析器类。这样,您就可以控制哪些文本被视为单词、边界、数字、日期等。例如,20110723 是日期还是数字?您可能需要实现多遍解析算法以更好地“理解”数据。
If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.
我也建议查看 LingPipe。如果您对网络服务感到满意,那么这篇文章有一个很好的总结不同的API
I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs
我宁愿采用 Lucene 的 Analysis 和 Stemmer 类,而不是重新发明轮子。他们涵盖了绝大多数案件。另请参阅附加类和贡献类。
I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.