使用 K-means 算法对文档进行聚类

发布于 2024-11-11 00:16:59 字数 125 浏览 6 评论 0原文

如何计算两个文档之间的距离?在数字的 k 均值中,您必须计算两点之间的距离。我知道我可以使用余弦函数。 我想对 rss 文档进行聚类。我已经完成了词干提取并从文档中删除了停用词。我统计了每个文档中单词的频率。现在我想实现 k 均值算法。

How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function.
I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

日裸衫吸 2024-11-18 00:16:59

我假设您的困难在于创建特征向量? 为每个文档创建一个特征向量,

  1. 通过将所有单词收集在一起形成一个巨大的向量,
  2. 并将该向量的元素设置为术语的计数。

例如,如果你有

Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat

那么总的单词集是 [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] 并且文档向量是

Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]

现在你只有两个巨大的您可以使用特征向量来表示文档,并且可以使用 k 均值聚类。正如其他人所说,欧几里德距离可以用来计算文档之间的距离。

I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by

  1. Collecting together all words to form a giant vector
  2. Setting the elements of that vector to be the count of terms.

For example, if you have

Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat

Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are

Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]

And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.

暮色兮凉城 2024-11-18 00:16:59

有各种距离函数。一种是欧几里得距离

There various distance functions. One is the Euclidean Distance.

土豪 2024-11-18 00:16:59

您可以对 n 维系统使用欧氏距离公式。

sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )

You can use the euclidean distance formula for an n-dimensional system.

sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文