使用频繁项集挖掘来构建关联规则？

发布于 2024-11-29 22:04:07 字数 557 浏览 0 评论 0原文

我对这个领域和术语都很陌生，所以如果我在某个地方出错，请随时提出建议。我有两个这样的数据集：

数据集 1：

A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E

我解释它的方式是在某个时间点，（A，B，C，E）一起发生，（A，C），（A，C，D，E）也一起发生）等。

数据集 2：

5A 1B 5C  0 2E
4A  0 5C  0  0
2A  0 1C 4D 4E
3A  0 4C  0 3E

我解释的方式是在某个时间点，A 出现 5 次，B 出现 1 次，C 出现 5 次，E 出现 2 次，依此类推。

我试图找出哪些项目同时发生，如果可能的话，还找出其原因和结果。为此，我不明白如何使用这两个数据集（或者一个数据集是否足够）。最好有一个关于这方面的好的教程，但我的主要问题是使用哪个数据集以及如何继续（i）构建频繁项集和（ii）构建它们之间的关联规则。

有人可以给我指出一个实用的教程/示例（最好是Python），或者至少简短地解释一下如何解决这个问题？

原文

I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this:

Dataset 1:

A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E

The way I interpret this is at some point in time, (A,B,C,E) occurred together and so did (A,C), (A,C,D,E) etc.

Dataset 2:

5A 1B 5C  0 2E
4A  0 5C  0  0
2A  0 1C 4D 4E
3A  0 4C  0 3E

The way I interpret this is at some point in time, 5 occurrences of A, 1 occurrence of B, 5 occurrences of C and 2 occurrences of E happened and so on.

I am trying to find what items occur together and if possible, also find out the cause and effect for this. For this, I am not understanding how to go about using both the datasets (or if one is enough). It would be good to have a good tutorial on this but my primary question is which dataset to utilize and how to proceed in (i) building a frequent itemset and (ii) building association rules between them.

Can someone point me to a practical tutorials/examples (preferably in Python) or at least explain in brief words on how to approach this problem?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤蝉 2024-12-06 22:04:07

有关关联规则的一些理论事实：

关联规则是一种无向数据挖掘，可在数据中查找事先未指定目标的模式。这些模式是否有意义取决于人类的解释。
关联规则的目标是检测大型集合中分类变量的特定值之间的关系或关联。
这个规则可以解释为“70%购买葡萄酒和奶酪的顾客也购买葡萄”。

要查找关联规则，可以使用 apriori 算法。已经存在许多 python 实现，尽管其中大多数在实际使用中效率不高：

或使用 Orange 数据挖掘库，它具有良好的关联规则库。

使用示例：

'''
save first example as item.basket with format
A, B, C, E
A, C
A, C, D, E
A, C, E
open ipython same directory as saved file or use os module
>>> import os
>>> os.chdir("c:/orange")
'''
import orange

items = orange.ExampleTable("item")
#play with support argument to filter out rules
rules = orange.AssociationRulesSparseInducer(items, support = 0.1) 
for r in rules:
    print "%5.3f %5.3f %s" % (r.support, r.confidence, r)

要了解更多关于关联规则/频繁项挖掘的知识，那么我选择的书籍是：

“数据挖掘简介” - Vipin Kumar，初学者的最佳书籍
《数据挖掘与知识发现手册》，针对高级用户
“挖掘海量数据” - 提示如何在现实生活中使用以及如何构建高效的解决方案，免费书籍，http://i.stanford.edu/~ullman/mmds.html
当然，有许多精彩的科学论文可供阅读：通过示例在 MS Acedemic 上搜索有关频繁模式挖掘的信息

没有捷径。

Some theoretical facts about association rules:

Association rules is a type of undirected data mining that finds patterns in the data where the target is not specified beforehand. Whether the patterns make sense is left to human interpretation.
The goal of association rules is to detect relationships or association between specific values of categorical variables in large sets.
And is rules can intrepreted as "70% of the the customers who buy wine and cheese also buy grapes".

To find association rules, you can use apriori algorithm. There already exists many python implementation, although most of them are not efficient for practical usage:

or use Orange data mining library, which has a good library for association rules.

Usage example:

'''
save first example as item.basket with format
A, B, C, E
A, C
A, C, D, E
A, C, E
open ipython same directory as saved file or use os module
>>> import os
>>> os.chdir("c:/orange")
'''
import orange

items = orange.ExampleTable("item")
#play with support argument to filter out rules
rules = orange.AssociationRulesSparseInducer(items, support = 0.1) 
for r in rules:
    print "%5.3f %5.3f %s" % (r.support, r.confidence, r)

To learn more about association rules/frequent item mining, then my selection of books are: