将相似的文档分组
该问题涉及信息检索中类似文档的分组/聚类。
我有一组文档,D1,D2,.. Dn。对于每个文档 Di,我还有一组关键字,Di_k1,Di_k2,...,Di_km。两个文档 Di 和 Dj 之间的相似度由涉及相关关键字的函数给出,即相似度(Di,Dj)= f(Di_K,Dj_K)。
现在,我想将这些文档中的每一个放入一组组/集群中,以便每个集群对于集群中存在的元素之间给定的相似性阈值包含相似类型的文档。
一种简单的方法是查看每一对可能的页面,这显然是我想要避免的,因为我拥有的文档数量相当大,有数百万。我正在阅读《信息检索简介》一书,但没有发现任何提到的可扩展算法。
我的问题是什么样的算法可以帮助我有效地对文档进行聚类?我对算法的计算复杂度特别感兴趣。
预先感谢您的任何指点。
This question relates to grouping/clustering similar documents in Information Retrieval.
I have a set of documents, D1, D2, .. Dn. For each document, Di, I also have a set of keywords, Di_k1, Di_k2, ..., Di_km. Similarity between two documents, Di and Dj is given by a function that involves the related keywords i.e. similarity(Di, Dj) = f(Di_K, Dj_K).
Now, I want to place each of these documents into a set of groups/clusters such that each cluster contains similar type of documents for a given a threshold value of similarity between the elements present in a cluster.
One easy way is to look at every pair of pages possible which I obviously want to avoid because the number of documents I have is fairly large, in millions. I was going through the Introduction to Information Retrieval book but I don't find any scalable algorithm mentioned.
My question is what kind of algorithm can help me cluster the documents efficiently? I am specially interested in the computational complexity of the algorithm.
Thanks in advance for any pointers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,我突然想到,您可以使用基于语言模型的方法。首先,使用机器学习为每个可能的类别构建 LM。比如说,一个二元组 LM。然后,对于您看到的每个新文档,计算所有类的 P(新文档|类)。选择概率最大的那个。使用贝叶斯法则简化上面的公式
Okay, off the top of my head ,you can use a Language model based approach . First , use machine learning to build a LM for each possible class. Say, a bigram LM. Then, for each new document you see, calculate P(new document| class) for all classes. Choose the one with the max probability. Use bayes rule to simplify the above formula
一是放松集群中所有文档之间的相似性。选取任意一个中心并且与中心具有相似性。
复杂度为
(n / avgClusterSize) * (n / 2)
One relax similarity between ALL document in the cluster. Pick an arbitrary center and have similarity to center.
Complexity is
(n / avgClusterSize) * (n / 2)