数据挖掘引擎和框架?
您知道并使用哪些开源/免费数据挖掘引擎和框架来处理文本数据?
感谢您的任何建议!
What opensource/free data mining engines and frameworks do you know and use for textual data?
Thank you for any advice!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
不太确定您在寻找什么。也许类似于 Lucene ?
Not really sure of what you're looking for. Perhaps something like Lucene?
Apache Mahout 是一个开源 Machile 学习库,可以与或不与 MapReduce (Apache Hadoop) 一起使用。
它提供了 Java 中的以下算法实现:
您可以阅读更多内容:
http://mahout.apache.org/
http://girlincomputerscience.blogspot.com.br/2010/11/apache-mahout.html
http://www.ibm.com/developerworks/java/library/j-mahout/
Apache Mahout is an OpenSource Machile Learning library, that can be used with or without MapReduce (Apache Hadoop).
It provides the folloeing algorithms implementation in Java:
You can read more:
http://mahout.apache.org/
http://girlincomputerscience.blogspot.com.br/2010/11/apache-mahout.html
http://www.ibm.com/developerworks/java/library/j-mahout/
RapidMiner 是免费且开源的,可在 Windows、Mac、Linux 上运行,是一个基于图形工作流程的优秀程序。它运行所有 Weka 代码,并与 R 集成。
RapidMiner is free and open source and runs on windows, mac, linux, and is a nice graphical workflow based program. It runs all Weka code, and integrates with R.
Weka 和 Rapidminer 在集群方面没有那么强。他们主要进行分类和类似的预测,但很少进行聚类。看看 ELKI,它就像 WEKA 一个大学项目,但有大量的集群和异常值检测方法。
Weka and Rapidminer aren't that strong on clustering. They mostly do classification and similar predictions, but very little clustering. Have a look at ELKI, which is like WEKA a university project, but has tons of clustering and outlier detection methods.
我不了解引擎或框架,但我使用过这个名为 Weka< 的工具/a>,它实现了很多算法。
I don't know about engines or frameworks, but I've used this tool called Weka, it has plenty of algorithms implemented in it.
对于文本处理(而不是数值数据挖掘和聚类),NLTK 工具包值得一看。目的是教授 Python 中的自然语言处理技术。因此它非常适合进行修改,如果您选择使用 Python,您一定会发现许多有用的组件类和实现。
And for text processing (rather than numeric data mining and clustering) then the NLTK toolkit is worth a look. This is intended to teach Natural Language Processing techniques in Python. So it is ideal for tinkering with, and you are bound to find many of the component classes and implementations useful if you choose to use Python.
RapidMiner 是我首选的数据挖掘解决方案:
http://www.RapidMiner.com/
这是数据挖掘专家中最流行的数据挖掘工具的调查:
http://www.kdnuggets.com/2011/05 /tools-used-analytics-data-mining.html
KDnuggets 2011 年民意调查:RapidMiner 是全球数据挖掘专家中使用最广泛的数据挖掘解决方案。
RapidMiner is my prefered data mining solution:
http://www.RapidMiner.com/
Here is survey of the most popular data mining tools among data mining experts:
http://www.kdnuggets.com/2011/05/tools-used-analytics-data-mining.html
KDnuggets Poll 2011: RapidMiner is the most widely used data mining solution among data mining experts world-wide.
我是一个用于频繁模式挖掘的 Java 开源软件的作者。它提供了挖掘顺序模式、关联规则、频繁项集等的算法。
虽然它不是专门为文本挖掘而设计的,但其中一些算法可以应用于挖掘文本中的频繁模式。例如,如果您想查找在多个句子中经常一起出现的一些单词序列,您可以应用顺序模式挖掘算法。但要做到这一点,您需要在应用我的软件之前进行一些预处理,以便您的文本文件采用正确的格式。
您可以在这里检查该软件:
http://www.philippe-fournier-viger.com/spmf/
I'm the author of a Java open-source software for frequent pattern mining. It offers algorithms for mining sequential patterns, association rules, frequent itemsets, etc.
Although it is not specifically designed for text mining, some of the algorithms could be applied in to mine frequent patterns in text. For example, if you want to find some sequences of words that appear often together in several sentences you could apply a sequential pattern mining algorithm. But to do that you would need to to some pre-processing before applying my software so that your text file are in the proper format.
You can check the software here:
http://www.philippe-fournier-viger.com/spmf/
Apache Mahout 提供了一系列流行的算法,这些算法也可以应用于文本数据,并且具有相当大的可扩展性! Apache UIMA 不提供数据挖掘算法,而是一个广泛用于自然语言处理的框架。
Apache Mahout offers a bunch of popular algorithms that can also be applied on textual data and is also quite scalable! Apache UIMA doesn't offer data mining algorithms but is a framework that is widely used in natural language processing.