当前位置：文江博客话题详情

使用 K-means 算法对文档进行聚类

发布于 2024-11-11 00:16:59 字数 125 浏览 7 评论 0原文

如何计算两个文档之间的距离？在数字的 k 均值中，您必须计算两点之间的距离。我知道我可以使用余弦函数。我想对 rss 文档进行聚类。我已经完成了词干提取并从文档中删除了停用词。我统计了每个文档中单词的频率。现在我想实现 k 均值算法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

日裸衫吸 2024-11-18 00:16:59

我假设您的困难在于创建特征向量？为每个文档创建一个特征向量，

通过将所有单词收集在一起形成一个巨大的向量，
并将该向量的元素设置为术语的计数。

例如，如果你有

Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat

那么总的单词集是 [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] 并且文档向量是

Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]

现在你只有两个巨大的您可以使用特征向量来表示文档，并且可以使用 k 均值聚类。正如其他人所说，欧几里德距离可以用来计算文档之间的距离。

I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by

Collecting together all words to form a giant vector
Setting the elements of that vector to be the count of terms.

For example, if you have

Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat

Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are

Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]

And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.

回复收藏 0 原文

暮色兮凉城 2024-11-18 00:16:59

有各种距离函数。一种是欧几里得距离。

回复收藏 0 原文

土豪 2024-11-18 00:16:59

您可以对 n 维系统使用欧氏距离公式。

sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )

You can use the euclidean distance formula for an n-dimensional system.

sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )

回复收藏 0 原文

~没有更多了~

关于作者

鸵鸟症

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

使用 K-means 算法对文档进行聚类

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

使用 K-means 算法对文档进行聚类

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。