python 数据挖掘
我不太热衷于数据挖掘,但我需要一些关于聚类的想法。我先描述一下我的问题。
我有大约 100 个数据表,其中包含用户评论。我正在尝试寻找描述质量的词语。一个人可以说这是惊人的质量,另一个人可以说这是很好的质量现在我必须对那些描述相似句子的文档进行聚类并获取这些句子的频率。这里应用什么概念?
我想我必须指定一些停用词和同义词。我对这个概念不太熟悉。
有人可以给我一些详细的链接或解释吗?以及使用什么工具?我基本上是一个Python程序员,所以任何Python模块都会受到赞赏。
谢谢
I am not too much onto data mining but I require some ideas on clustering. Let me first describe my problem.
I have a around 100 data sheets which contain user reviews. I am trying to find for instances words that describe quality. One can say it is amazing quality another person can say great quality now I have to cluster those documents which describe those similar sentences and get the frequency of such sentences. What concept to apply here?
Guess I have to specify some stop words and synonyms. I am not too familiar with this concept.
Can some one give me some detailed links or explanation? and what tool to be used? I am basically a python programmer so any python module would be appreciated.
Thank You
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有 http://www.nltk.org/ 用于语言处理。通过这个库,您可以将文本拆分成句子、计算术语频率、查找同义词等等。
Carrot^2 是一个很好的用于聚类文本片段的开源项目,不幸的是它是用 Java 编写的。其聚类背后的想法是术语和短语(二元组和三元组)频率。预处理后,每个文档(片段、评论)都表示为术语/短语频率的向量。为了计算簇,他们使用一些线性代数并找到该术语空间中的主成分。然后使用这些组件来形成它们的簇和标签。
在您的情况下,值得将评论视为文档,对它们进行聚类并获取聚类标签。可能标签会以某种方式评估评论。
在您的具体情况下,值得消除感兴趣的单词,从而显着降低维度,这在此类任务中非常关键
另一个有用的项目 - montylingua
There is http://www.nltk.org/ for language processing. With this library you are able to split text into sentences, calculate term frequences, find synonyms and more.
Carrot^2 is a nice opensource project for clustering text snippets, unfortunately it's written in Java. The idea behind its clustering is terms and phrases (bigrams and trigrams) frequences. After preprocessing each document (snippet, review) is represented as a vector of term/phrase frequences. To calculate clusters they use some linear algebra and find principal components in that terms space. Then this components are used to form clusters and labels for them.
In yuor case it's worth considering reviews as documents, cluster them and get labels for clusters. May be labels would somehow evaluate reviews.
In your specific case it's worth eliminate words of interest so dramatically decreasing dimensionality which is very critical in such tasks
Another useful project - montylingua
我会遵循CrossValidated 上的这个问题中的主要建议。特别是,请查看 scikit-learn。
I would follow the primary suggestion out of this question on CrossValidated. In particular, have a look at scikit-learn.
这里有两篇从评价文本中提取信息的论文。看起来他们正在做你想做的事。
http://citeseerx.ist.psu.edu/viewdoc/summary ?doi=10.1.1.91.9534
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.5392&rep=rep1&type=pdf
Here are two papers that extract information from evaluative text. It seems like they're doing what you're looking to do.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.9534
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.5392&rep=rep1&type=pdf