如何计算聚类中的精度和召回率?
我真的很困惑如何在聚类应用程序中计算精度和召回率。
我遇到以下情况:
给定两个集合 A 和 B。通过为每个元素使用唯一键,我可以确定 A 和 B 中的哪些元素匹配。 我想根据功能对这些元素进行聚类(当然不使用唯一键)。
我正在进行聚类,但我不确定如何计算精度和召回率。 公式,根据论文“Extended Performance Graphs for Cluster Retrieval”(http ://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf)是:
p = 精度 = 相关检索项目/检索项目和 r = 召回 = 相关检索项目/相关项目
我真的不知道哪些元素属于哪个类别。
到目前为止我所做的是,我在集群中检查了我有多少匹配对(使用唯一键)。 这已经是精确度或召回率之一了吗? 如果是的话,它是哪一个以及我如何计算另一个?
I am really confused how to compute precision and recall in clustering applications.
I have the following situation:
Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to cluster those elements based on features (not using the unique key of course).
I am doing the clustering but I am not sure how to compute precision and recall. The formulas,according to the paper "Extended Performance Graphs for Cluster Retrieval" (http://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf) are:
p = precision = relevant retrieved items/retrieved items and
r = recall = relevant retrieved items/relevant items
I really do not get what elements fall under which category.
What I did so far is, I checked within the clusters how many matching pairs I have (using the unique key). Is that already one of precision or recall? And if so, which one is it and how can I compute the other one?
Update: I just found another paper with the title "An F-Measure for Evaluation of Unsupervised Clustering with Non-Determined Number of Clusters" at http://mtg.upf.edu/files/publications/unsuperf.pdf.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我想您会发现维基百科有一篇有用的关于精确度和召回率的文章。 简而言之:
精确率 = 真阳性 /(真阳性 + 假阳性)
召回率 = 真阳性 /(真阳性 + 假阴性)
I think you'll find wikipedia has a helpful article on precision and recall. In short:
Precision = true positives / (true positives + false positives)
Recall = true positives /( true positivies + false negatives)
在访问聚类方法的一些研究中,我一直在使用其他几种聚类有效性度量。 如果您有一个标有类别(监督聚类)的数据集,您可以使用上面提到的精度和召回率,或者纯度和熵。
簇的纯度 = 最频繁的类出现的次数 / 簇的大小(这应该很高)
簇的熵 = 簇中类的分散程度的度量(这应该很低
)你没有类标签(无监督聚类),内部和内部相似性是很好的衡量标准。
单个簇的簇内相似度 = 簇内所有对的平均余弦相似度(这应该很高)
单个簇的簇间相似度 = 一个簇中所有项目与每个其他簇中所有项目相比的平均余弦 sim (这应该很低)
本文对所有这四种措施都有一些很好的描述。
http://glaros.dtc.umn.edu/gkhome/fetch/papers /edcICAIL05.pdf
与无监督 F 测量的良好链接,我现在正在研究它。
There are several other measures of cluster validity that I've been using in some research I've been doing in accessing clustering methods. In cases where you have a dataset labeled with classes (supervised clustering) you can use precision and recall as mentioned above, or purity and entropy.
Purity of a cluster = the number of occurrences of the most frequent class / the size of the cluster (this should be high)
Entropy of a cluster = a measure of how dispersed classes are with a cluster (this should be low)
In cases where you don't have the class labels (unsupervised clustering), intra and inter similarity are good measures.
Intra-cluster similarity for a single cluster = average cosine similarity of all pairs within a cluster (this should be high)
Inter-cluster similarity for a single cluster = average cosine sim of all items in one cluster compared to all items in every other cluster (this should be low)
This paper has some good descriptions of all four of these measures.
http://glaros.dtc.umn.edu/gkhome/fetch/papers/edcICAIL05.pdf
Nice link with the unsupervised F-measure, I'm looking into that right now.
我对这个问题的理解是:
集合 A 和 B 之一是“正”集合。 假设 A 为正
,假设对于簇中 A 的元素,
然后只需使用
精度=真阳性/(真阳性+假阳性)
召回率=真阳性/(真阳性+假阴性)
正如有人提到的
What I make of this problem is:
One of the sets A and B is the "positive" one. Lets suppose A is positive
Given that for an element of A in a cluster
Then just use
Precision = true positives / (true positives + false positives)
Recall = true positives /( true positivies + false negatives)
as mentioned by someone
有关评估聚类算法的方法,请参阅“信息检索简介”第 18 章(胖聚类)。
http://nlp.stanford.edu/IR- book/html/htmledition/flat-clustering-1.html
本书的这一部分也可能很有用,因为它讨论了精度和召回率等指标:
http://nlp. stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html
See "Introduction to Information Retrieval", chapter 18 (fat clustering), for ways to evaluate clustering algorithms.
http://nlp.stanford.edu/IR-book/html/htmledition/flat-clustering-1.html
This section of the book may also prove useful as it discusses metrics such as precision and recall:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html
精确度和召回率的问题在于,它们通常要求您了解“真实”标签是什么,而在许多情况下(在您的描述中)您不知道标签,但您知道分区进行比较。 我建议调整后的兰德指数也许:
http://en.wikipedia。 org/wiki/Rand_index
The problem with precision and recall is that they generally require you to have some idea of what the 'true' labels are, whereas in many cases (and in your description) you don't know the labels, but you know the partition to compare against. I'd suggest the adjusted Rand index perhaps:
http://en.wikipedia.org/wiki/Rand_index
我觉得你的定义有问题。
精确率和召回率适用于分类问题,分类问题基本上是二聚类问题。 如果您聚集成“好项目”(=检索到的项目)和“坏项目”(=未检索到的项目)之类的东西,那么您的定义就有意义了。
在您的情况下,您计算了所有项目中正确聚类的百分比,这有点像精度,但实际上并非如此,因为正如我所说,定义不适用。
I think there's a problem with your definitions.
Precision and recall are suited for classification problem, which are basically two-clusters problems. Had you clustered into something like "good items" (=retrieved items) and "bad items" (=non retrieved items), then your definition would make sense.
In your case you calculated the percentage of correct clustering out of all the items, which is sort of like precision, but not really because as I said the definitions don't apply.
如果您将其中一组(例如 A)视为黄金聚类,将另一组 (B) 视为聚类过程的输出,则(精确)精度和召回值可估计为:
从这些标准 F 度量也可以估计出来。
If you consider one of the sets, say A, as gold clustering and the other set (B) as an output of your clustering process, (exact) precision and recall values can be estimated as:
From these standard F measure can be estimated as well.