tf-idf:我理解对吗?
我对进行一些文档聚类感兴趣,现在我正在考虑使用 TF-IDF 来实现此目的。
如果我没记错的话,TF-IDF 特别用于评估文档给定查询的相关性。如果我没有特定的查询,如何将 tf-idf 应用于聚类?
I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.
If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
用于文档聚类。最好的方法是使用k-means 算法。如果您知道有多少种类型的文档,您就知道 k 是什么。
要使其适用于文档:
a) 随机选择初始 k 个文档。
b) 使用文档与簇的最小距离将每个文档分配给一个簇。
c) 将文档分配到簇后,通过取每个簇的质心,将 K 个新文档作为簇。
现在的问题是
a)如何计算两个文档之间的距离:它只不过是文档术语与初始聚类的余弦相似度。这里的术语只不过是 TF-IDF(之前为每个文档计算的)
b) 质心应该是:给定术语/编号的 TF-IDF 之和。的文件。对簇中所有可能的项执行此操作。这将为您提供另一个 n 维文档。
希望这有帮助!
For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.
To make it work on documents:
a) say choose initial k documents at random.
b) Assign each document to a cluser using the minimum distance for a document with the cluster.
c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.
Now, the question is
a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)
b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.
Hope thats helps!
实际上并不完全是:tf-idf 为您提供给定文档中术语的相关性。
因此,您可以通过计算邻近度将其完美地用于聚类,这类似于
doc i 和 doc j 中的每个术语 t 。
Not exactly actually: tf-idf gives you the relevance of a term in a given document.
So you can perfectly use it for your clustering by computing a proximity which would be something like
for each term t both in doc i and doc j.
TF-IDF 有不同的用途;除非你打算重新发明轮子,否则你最好使用像 Carrot 这样的工具。如果您想自己实现一种算法,则通过谷歌搜索文档聚类可以为您提供多种算法。
TF-IDF serves a different purpose; unless you intend to reinvent the wheel, you are better of using a tool like Carrot. Googling for document clustering can give you many algorithms if you wish to implement one on your own.