Java 开源文本挖掘框架
我想知道什么是最好的基于 Java 的开源文本挖掘框架,以使用 botg 机器学习和字典方法。
我正在使用 Mallet,但没有那么多文档,我不知道它是否能满足我的所有要求。
I want to know what is the best open source Java based framework for Text Mining, to use botg Machine Learning and dictionary Methods.
I'm using Mallet but there are not that much documentation and I do not know if it will fit all my requirements.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
老实说,我认为这里提出的几个答案都非常好。但是,为了满足我的要求,我选择使用 Apache UIMA 和 ClearTK。它支持多种机器学习方法,并且我没有任何许可证问题。另外,我可以为其他 ML 方法制作包装器,并且我利用 UIMA 框架的优势,该框架组织良好且速度很快。
谢谢大家的有趣回答。
此致,
乌克兰
I honestly think that the several answers presented here are very good. However, to fulfill my requirements I have chosen to use Apache UIMA with ClearTK. It supports several ML Methods and I do not have any licences problem. Plus, I can make wrappers to other ML methodologies, and I take the advantage of the UIMA framework, which is very well organized and fast.
Thank you all for your interesting answers.
Best Regards,
ukrania
虽然不是专门的文本挖掘框架,但 Weka 通常有许多分类器用于文本挖掘任务,例如:SVM、kNN、多项式 NaiveBayes 等。
它还具有一些可处理文本数据的过滤器,例如可以执行 TF/IDF 转换的 StringToWordVector 过滤器。
请访问 Weka wiki 网站了解更多信息。
Although not a specialized text mining framework, Weka has a number of classifiers usually employed in text mining tasks such as: SVM, kNN, multinomial NaiveBayes, among others.
It also has a few filters to wok with textual data like the
StringToWordVector
filter which can perform TF/IDF transformation.Check out the Weka wiki website for more information.
也许看看 Java Open Source NLP 和文本挖掘工具。
Maybe have a look at Java Open Source NLP and Text Mining tools.
我使用过LingPipe——一套Java用于人类语言的语言分析的库——用于文本挖掘(和其他相关)任务。
它是一个非常文档齐全的软件包,并且该站点包含多个教程,详细解释了如何使用 LingPipe 执行特定任务,例如 命名实体识别。还有一个新闻组,您可以在其中发布有关该软件(或 NLP 相关任务)的任何问题,并得到软件包作者本人的及时回复;当然,还有博客。
源代码也非常容易理解并且有很好的文档记录,这对我来说始终是一个很大的优势。
至于机器学习算法,有很多,从朴素贝叶斯到 条件随机场。另一方面,对于字典匹配算法,它们有一个 ExactDicitonaryChunker,它是 Aho-Corasich 算法(用于此任务的非常非常快的算法)的实现。
总而言之,我认为它是最好的 Java NLP 软件包之一(我没有使用过所有的软件包,所以我不能说它是最好的),并且我绝对推荐它用于以下任务:你手头有。
I've used LingPipe -- a suite of Java libraries for the linguistic analysis of human language -- for text mining (and other related) tasks.
It is a very well documented software package, and the site contains several tutorials which thoroughly explain how to do a certain task with LingPipe, such as named entity recognition. There is also a newsgroup, wherein you can post any question you have about the software (or NLP related tasks), and have a prompt reply from the authors of the package themselves; and of course, a blog.
The source code is also very easy to follow and well documented which, for me, is always a big plus.
As for Machine Learning algorithms, there are plenty, from Naïve Bayes to Conditional Random Field. On the other hand, for dictionary-matching algorithms, they have an ExactDicitonaryChunker, which is an implementation of the Aho-Corasich algorithm (a very, very, fast algorithm for this task).
In sum, I think it is one of the best NLP software package for Java (I haven't used every single package that is out there, so I can't say it's the best), and I definitely recommend it for the task that you have at hand.
您可能已经了解 GATE:http://gate.ac.uk/
...但这就是什么我们(在我的日常工作中)已经使用它来解决许多不同的文本挖掘问题。它非常灵活和开放。
You may already know about GATE: http://gate.ac.uk/
...but that's what we've used (at my day job) for lots of different text mining problems. It's pretty flexible and open.
我使用 OpenNLP MaxEnt http://sourceforge.net/projects/maxent 为 CoNLL 数据构建了最大熵命名实体识别器/ 一次课程。
不过,需要使用自定义 Perl 脚本进行大量数据预处理,确实可以将所有特征提取到漂亮整洁的数值向量中。
I built a maximum entropy named entity recognizer for CoNLL data using OpenNLP MaxEnt http://sourceforge.net/projects/maxent/ for a course once.
Required a lot of data preprocessing with custom perl scripts do get all the features extracted into nice neat numerical vectors though.
我们使用 lucene 来处理来自互联网的实时流。它有一个原生的java api。
http://lucene.apache.org/java/docs/
然后你可以使用 mahout这是一堆在 lucene 之上运行的机器学习算法。
http://lucene.apache.org/mahout/
We use lucene to process live streams from the internet. It has a native java api.
http://lucene.apache.org/java/docs/
You can then use mahout which is a bunch of machien learning algorithms which operate on top of lucene.
http://lucene.apache.org/mahout/