混淆矩阵和列联表有什么区别?
我正在编写一段代码来评估我的聚类算法,我发现每种评估方法都需要来自 m*n
矩阵的基本数据,例如 A = {aij} 其中
aij
是类 ci
成员和簇 kj
元素的数据点数量。
但在数据挖掘导论(Pang-Ning Tan等人)中似乎有两种这种类型的矩阵,一种是混淆矩阵,另一种是列联表。我不完全理解两者之间的区别。哪个最能描述我想要使用的矩阵?
I'm writting a piece of code to evaluate my Clustering Algorithm and I find that every kind of evaluation method needs the basic data from a m*n
matrix like A = {aij}
where aij
is the number of data points that are members of class ci
and elements of cluster kj
.
But there appear to be two of this type of matrix in Introduction to Data Mining (Pang-Ning Tan et al.), one is the Confusion Matrix, the other is the Contingency Table. I do not fully understand the difference between the two. Which best describes the matrix I want to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
维基百科的定义:
混淆矩阵应该很清楚,它基本上告诉了有多少实际结果与预测结果相匹配。例如,参见这个混淆矩阵,
它告诉我们:
Column1,row 1 表示分类器预测 15 个项目属于类
c1
,而实际上 15 个项目属于类c1
(这是正确的预测)第二列第 1 行告诉分类器预测有 3 个项目属于类
c2
,但它们实际上属于类c1
(这是一个错误的预测)第 1 列第 2 行意味着实际上属于类
c2
的项目都没有被预测为属于类c1
(这是一个错误的预测) )第 2 列第 2 行表示 2 个项目属于类
c2
的内容已被预测为属于类c2
(这是正确的预测)现在请参阅您书中的准确率和错误率公式(第 4 章,4.2),以及你应该能够清楚地理解什么是混淆矩阵。它用于使用具有已知结果的数据来测试分类器的准确性。 K-Fold 方法(书中也提到过)是计算分类器准确性的方法之一,您的书中也提到过。
现在,对于列联表:
维基百科的定义:
在数据挖掘中,列联表用于显示哪些项目一起出现在阅读中,例如在交易中或在销售分析的购物车中。例如(这是您提到的书中的示例):
它表明在 1000 个回复中(关于他们喜欢咖啡和茶还是两者或其中之一的回复,调查结果):
列联表用于查找关联规则的支持度和置信度,基本上是为了评估关联规则(请阅读第 6 章 6.7.1)。
现在的区别在于,混淆矩阵用于评估分类器的性能,它告诉分类器对分类进行预测的准确性,而列联表用于评估关联规则。
现在,读完答案后,谷歌一下(在阅读书本时始终使用谷歌),阅读书中的内容,看一些示例,并且不要忘记解决书中给出的一些练习,然后您应该对它们有一个清晰的概念,以及在特定情况下使用什么以及为什么使用。
希望这有帮助。
Wikipedia's definition:
Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix
It tells that:
Column1, row 1 means that the classifier has predicted 15 items as belonging to class
c1
, and actually 15 items belong to classc1
(which is a correct prediction)the second column row 1 tells that the classifier has predicted that 3 items belong to class
c2
, but they actually belong to classc1
(which is a wrong prediction)Column 1 row 2 means that none of the items that actually belong to class
c2
have been predicted to belong to classc1
(which is a wrong prediction)Column 2 row 2 tells that 2 items that belong to class
c2
have been predicted to belong to classc2
(which is a correct prediction)Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.
Now, for Contingency table:
Wikipedia's definition:
In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):
It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):
Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).
Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.
Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.
Hope this helps.
简而言之,列联表就是用来描述数据的。正如其他人指出的那样,混淆矩阵在比较两个假设时经常使用。人们可以将预测与实际分类/分类视为两个假设,基本事实为空,模型输出为替代。
In short, contingency table is used to describe data. and confusion matrix is, as others have pointed out, often used when comparing two hypothesis. One can think of predicted vs actual classification/categorization as two hypothesis, with the ground truth being the null and the model output being the alternative.