文本挖掘库还是语言库?
我从我拥有的论坛中收集了一堆数据,并且想要进行一些文本挖掘或使用一些语言库来提取有用的信息。
任何语言的文本挖掘、数据挖掘库都可以。
谢谢。
i have a bunch of data harvested from a forum I own, and would like to do some text mining or use some linguistic library to extract useful information.
any text mining, data mining library in any language will do.
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我建议您看看 R。它有大量的文本挖掘包:看看自然语言处理查看。特别是查看
tm
包。以下是一些相关链接:邮件列表 (https://stat.ethz.ch/pipermail/r-devel/) 2006 年以来的新闻组帖子。
另一个有用的包的例子是 Gary King 的自述文件包。
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the
tm
package. Here are some relevant links:mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Another example of useful package for this is Gary King's readme package.
您可能想看看Python NLTK(自然语言工具包):它是专门为这种类型设计的的东西。
还有一本好书供您入门。
You may like to have a look at the Python NLTK (Natural Language ToolKit): it's specifically designed for this kind of thing.
There is also a great book you can but to get you started.
Mallet 是一个专为文本挖掘而设计的 java 库。预处理完文本数据后,可以使用通用数据挖掘工具,例如 Weka也足以满足你的任务。
如果您可以使用 SPSS 或 SAS,他们的产品应该更容易使用。
Mallet is a java library designed for text mining. Once you have preprocessed the text data, a general data mining tool like Weka would also suffice your task.
If you have access to SPSS or SAS, their products should be more easier to use.
尝试一下 GATE,它有 GUI,当然你可以使用 java api 来获得更多功能:
http://gate.ac.uk/family/developer.html
您还可以使用Weka 用于处理文本和进行文本挖掘,看看这些有用的讲座:
http://sentimentmining.net/weka/
Try GATE, it has GUI and of course you can use java api for more power:
http://gate.ac.uk/family/developer.html
You can also use Weka for processing text and doing text mining, have a look at these useful lectures:
http://sentimentmining.net/weka/
stanford core-nlp 适用于英文文本,并且具有命名实体识别等功能。看一下:http://nlp.stanford.edu/software/corenlp.shtml
Ehsan 已经推荐的 GATE 也不错,但如果您需要编写自己的组件,它可能会有点复杂。对于大型的东西来说这是很棒的。
UIMA 与 GATE 类似,但使用起来不太方便,因为它不像 GATE 那样具有广泛的 GUI。 (http://uima.apache.org)
stanford core-nlp is good for English text, and has things like Named Entity Recognition. Take a look at: http://nlp.stanford.edu/software/corenlp.shtml
GATE, which Ehsan already recommended, is also good, but it can be a bit complicated if you need to write your own components. For large-scale stuff it's great though.
UIMA is similar to GATE, but not as easy to use because it doesn't feature an extensive GUI like GATE. (http://uima.apache.org)
我推荐以下Python库:
nltk
keras
tensorflow
注意:在进行任何文本分析之前,您应该根据您的要求清理数据
I would recommend the following Python libraries:
nltk
keras
tensorflow
Note: Before any text analysis you should clean the data based on your requirement