当聚类超过 3 个文档时,Kmeans 运行速度异常缓慢
我正在尝试使用 kmeans 将相似的文档彼此聚类。
我正在使用 NLTK 的 KMeans。
当我只聚类 3 个文档时,花费的时间不到 5 秒。但是一旦我添加了第四个文档,它就没有完成(10 分钟后我把它剪掉了)。
当有 4 个文档时,向量大小约为 1000。向量也很稀疏,但我有 8 GB RAM,所以我不担心这一点。 1000应该不是那么多。
有人知道为什么它在 5 秒内解决了 3 个文档,但无法解决 4 个文档......至少在 10 分钟内放弃?当我投入生产时,理论上一次必须集群 300 或 400 个文档。
我正在考虑尝试不同的 kmeans 库来查看 NLTK 实现是否较弱,但如果我是问题所在,我不想浪费我的精力。
谢谢大家。
I'm trying to use kmeans to cluster similar documents to each other.
I am using NLTK's KMeans.
When I only cluster 3 documents, it takes less than 5 seconds. But once I add in a fourth document, it doesn't finish (I cut it out after 10 minutes).
When there are 4 documents, the vector size is about 1000. The vectors are sparse too, but I have 8 gigs of RAM, so I'm not worried about that. 1000 shouldn't be that much.
Anyone have any ideas why it solves 3 documents in 5 seconds, but can't solve 4 documents...at least in 10 minutes before giving up? When I go into production, it will theoretically have to cluster 300 or 400 documents at a time.
I was thinking of trying a different kmeans library to see if the NLTK implementation is weak, but I don't want to waste my effort if I'm the problem.
Thanks all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我切换到 Pycluster 库,现在它可以工作了。
I switched to Pycluster library and it works now.