混淆矩阵和列联表有什么区别？

发布于 2024-12-07 13:47:56 字数 275 浏览 7 评论 0原文

我正在编写一段代码来评估我的聚类算法，我发现每种评估方法都需要来自 m*n 矩阵的基本数据，例如 A = {aij} 其中 aij 是类 ci 成员和簇 kj 元素的数据点数量。

但在数据挖掘导论（Pang-Ning Tan等人）中似乎有两种这种类型的矩阵，一种是混淆矩阵，另一种是列联表。我不完全理解两者之间的区别。哪个最能描述我想要使用的矩阵？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

硪扪都還晓 2024-12-14 13:47:56

维基百科的定义：

在人工智能领域，混淆矩阵是一个
可视化工具通常用于监督学习（在
无监督学习通常称为匹配矩阵）。每个
矩阵的列表示预测类中的实例，
而每一行代表实际类中的实例。

混淆矩阵应该很清楚，它基本上告诉了有多少实际结果与预测结果相匹配。例如，参见这个混淆矩阵，

                 predicted class
                        c1  -  c2
  Actual class   c1     15  -   3
                ___________________
                 c2     0   -   2

它告诉我们：

Column1，row 1 表示分类器预测 15 个项目属于类 c1，而实际上 15 个项目属于类 c1 （这是正确的预测）
第二列第 1 行告诉分类器预测有 3 个项目属于类 c2，但它们实际上属于类c1 （这是一个错误的预测）
第 1 列第 2 行意味着实际上属于类 c2 的项目都没有被预测为属于类 c1（这是一个错误的预测） )
第 2 列第 2 行表示 2 个项目属于类 c2 的内容已被预测为属于类 c2（这是正确的预测）

现在请参阅您书中的准确率和错误率公式（第 4 章，4.2），以及你应该能够清楚地理解什么是混淆矩阵。它用于使用具有已知结果的数据来测试分类器的准确性。 K-Fold 方法（书中也提到过）是计算分类器准确性的方法之一，您的书中也提到过。

现在，对于列联表：
维基百科的定义：

在统计学中，列联表（也称为交叉表）
制表或交叉表）是一种矩阵格式的表格，
显示变量的（多变量）频率分布。
常用于记录和分析两个或两个事物之间的关系
更多分类变量。

在数据挖掘中，列联表用于显示哪些项目一起出现在阅读中，例如在交易中或在销售分析的购物车中。例如（这是您提到的书中的示例）：

       Coffee  !coffee
tea    150       50      200
!tea   650       150     800
       800       200    1000

它表明在 1000 个回复中（关于他们喜欢咖啡和茶还是两者或其中之一的回复，调查结果）：

150 人既喜欢茶又喜欢咖啡
50 人喜欢茶但不喜欢咖啡
650 人不喜欢茶但喜欢咖啡
150 人既不喜欢茶也不喜欢咖啡

列联表用于查找关联规则的支持度和置信度，基本上是为了评估关联规则（请阅读第 6 章 6.7.1）。

现在的区别在于，混淆矩阵用于评估分类器的性能，它告诉分类器对分类进行预测的准确性，而列联表用于评估关联规则。

现在，读完答案后，谷歌一下（在阅读书本时始终使用谷歌），阅读书中的内容，看一些示例，并且不要忘记解决书中给出的一些练习，然后您应该对它们有一个清晰的概念，以及在特定情况下使用什么以及为什么使用。

希望这有帮助。

Wikipedia's definition:

In the field of artificial intelligence, a confusion matrix is a
visualization tool typically used in supervised learning (in
unsupervised learning it is typically called a matching matrix). Each
column of the matrix represents the instances in a predicted class,
while each row represents the instances in an actual class.

Confusion matrix should be clear, it basically tells how many actual results match the predicted results. For example, see this confusion matrix

                 predicted class
                        c1  -  c2
  Actual class   c1     15  -   3
                ___________________
                 c2     0   -   2

It tells that:

Column1, row 1 means that the classifier has predicted 15 items as belonging to class c1, and actually 15 items belong to class c1 (which is a correct prediction)
the second column row 1 tells that the classifier has predicted that 3 items belong to class c2, but they actually belong to class c1 (which is a wrong prediction)
Column 1 row 2 means that none of the items that actually belong to class c2 have been predicted to belong to class c1 (which is a wrong prediction)
Column 2 row 2 tells that 2 items that belong to class c2 have been predicted to belong to class c2 (which is a correct prediction)

Now see the formula of Accuracy and Error Rate from your book (Chapter 4, 4.2), and you should be able to clearly understand what is a confusion matrix. It is used to test the accuracy of a classifier using data with known results. The K-Fold method (also mentioned in the book) is one of the methods to calculate the accuracy of a classifier that has also been mentioned in your book.

Now, for Contingency table:
Wikipedia's definition:

In statistics, a contingency table (also referred to as cross
tabulation or cross tab) is a type of table in a matrix format that
displays the (multivariate) frequency distribution of the variables.
It is often used to record and analyze the relation between two or
more categorical variables.

In data mining, contingency tables are used to show what items appeared in a reading together, like in a transaction or in the shopping-cart of a sales analysis. For example (this is the example from the book you have mentioned):

       Coffee  !coffee
tea    150       50      200
!tea   650       150     800
       800       200    1000

It tells that in 1000 responses (responses about do they like Coffee and tea or both or one of them, results of a survey):

150 people like both tea and coffee
50 people like tea but do not like coffee
650 people do not like tea but like coffee
150 people like neither tea nor coffee

Contingency tables are used to find the Support and Confidence of association rules, basically to evaluate association rules (read Chapter 6, 6.7.1).

Now the difference is that Confusion Matrix is used to evaluate the performance of a classifier, and it tells how accurate a classifier is in making predictions about classification, and contingency table is used to evaluate association rules.

Now after reading the answer, google a bit (always use google while you are reading your book), read what is in the book, see a few examples, and don't forget to solve a few exercises given in the book, and you should have a clear concept about both of them, and also what to use in a certain situation and why.

Hope this helps.

回复收藏 0 原文