使用 K-means 算法对文档进行聚类
如何计算两个文档之间的距离?在数字的 k 均值中,您必须计算两点之间的距离。我知道我可以使用余弦函数。 我想对 rss 文档进行聚类。我已经完成了词干提取并从文档中删除了停用词。我统计了每个文档中单词的频率。现在我想实现 k 均值算法。
How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function.
I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我假设您的困难在于创建特征向量? 为每个文档创建一个特征向量,
例如,如果你有
那么总的单词集是 [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] 并且文档向量是
现在你只有两个巨大的您可以使用特征向量来表示文档,并且可以使用 k 均值聚类。正如其他人所说,欧几里德距离可以用来计算文档之间的距离。
I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by
For example, if you have
Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are
And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.
有各种距离函数。一种是欧几里得距离。
There various distance functions. One is the Euclidean Distance.
您可以对 n 维系统使用欧氏距离公式。
You can use the euclidean distance formula for an n-dimensional system.