使用频繁项集挖掘来构建关联规则?

发布于 2024-11-29 22:04:07 字数 557 浏览 0 评论 0原文

我对这个领域和术语都很陌生,所以如果我在某个地方出错,请随时提出建议。我有两个这样的数据集:

数据集 1:

A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E

我解释它的方式是在某个时间点,(A,B,C,E)一起发生,(A,C),(A,C,D,E)也一起发生)等。

数据集 2:

5A 1B 5C  0 2E
4A  0 5C  0  0
2A  0 1C 4D 4E
3A  0 4C  0 3E

我解释的方式是在某个时间点,A 出现 5 次,B 出现 1 次,C 出现 5 次,E 出现 2 次,依此类推。

我试图找出哪些项目同时发生,如果可能的话,还找出其原因和结果。为此,我不明白如何使用这两个数据集(或者一个数据集是否足够)。最好有一个关于这方面的好的教程,但我的主要问题是使用哪个数据集以及如何继续(i)构建频繁项集和(ii)构建它们之间的关联规则。

有人可以给我指出一个实用的教程/示例(最好是Python),或者至少简短地解释一下如何解决这个问题?

I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this:

Dataset 1:

A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E

The way I interpret this is at some point in time, (A,B,C,E) occurred together and so did (A,C), (A,C,D,E) etc.

Dataset 2:

5A 1B 5C  0 2E
4A  0 5C  0  0
2A  0 1C 4D 4E
3A  0 4C  0 3E

The way I interpret this is at some point in time, 5 occurrences of A, 1 occurrence of B, 5 occurrences of C and 2 occurrences of E happened and so on.

I am trying to find what items occur together and if possible, also find out the cause and effect for this. For this, I am not understanding how to go about using both the datasets (or if one is enough). It would be good to have a good tutorial on this but my primary question is which dataset to utilize and how to proceed in (i) building a frequent itemset and (ii) building association rules between them.

Can someone point me to a practical tutorials/examples (preferably in Python) or at least explain in brief words on how to approach this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤蝉 2024-12-06 22:04:07

有关关联规则的一些理论事实:

  • 关联规则是一种无向数据挖掘,可在数据中查找事先未指定目标的模式。这些模式是否有意义取决于人类的解释。
  • 关联规则的目标是检测大型集合中分类变量的特定值之间的关系或关联。
  • 这个规则可以解释为“70%购买葡萄酒和奶酪的顾客也购买葡萄”。

要查找关联规则,可以使用 apriori 算法。已经存在许多 python 实现,尽管其中大多数在实际使用中效率不高:

或使用 Orange 数据挖掘库,它具有良好的关联规则库

使用示例:

'''
save first example as item.basket with format
A, B, C, E
A, C
A, C, D, E
A, C, E
open ipython same directory as saved file or use os module
>>> import os
>>> os.chdir("c:/orange")
'''
import orange

items = orange.ExampleTable("item")
#play with support argument to filter out rules
rules = orange.AssociationRulesSparseInducer(items, support = 0.1) 
for r in rules:
    print "%5.3f %5.3f %s" % (r.support, r.confidence, r)

要了解更多关于关联规则/频繁项挖掘的知识,那么我选择的书籍是:

没有捷径。

Some theoretical facts about association rules:

  • Association rules is a type of undirected data mining that finds patterns in the data where the target is not specified beforehand. Whether the patterns make sense is left to human interpretation.
  • The goal of association rules is to detect relationships or association between specific values of categorical variables in large sets.
  • And is rules can intrepreted as "70% of the the customers who buy wine and cheese also buy grapes".

To find association rules, you can use apriori algorithm. There already exists many python implementation, although most of them are not efficient for practical usage:

or use Orange data mining library, which has a good library for association rules.

Usage example:

'''
save first example as item.basket with format
A, B, C, E
A, C
A, C, D, E
A, C, E
open ipython same directory as saved file or use os module
>>> import os
>>> os.chdir("c:/orange")
'''
import orange

items = orange.ExampleTable("item")
#play with support argument to filter out rules
rules = orange.AssociationRulesSparseInducer(items, support = 0.1) 
for r in rules:
    print "%5.3f %5.3f %s" % (r.support, r.confidence, r)

To learn more about association rules/frequent item mining, then my selection of books are:

There is no short way.

篱下浅笙歌 2024-12-06 22:04:07

处理此类问题的一个巧妙方法是使用贝叶斯网络。特别是作为贝叶斯网络结构的学习问题。一旦掌握了这些,您将能够有效地回答 p(A=1|B=0 和 C=1) 等问题。

It seems like a neat way to handle this type of problems is using a Bayesian network. In particular as a Bayesian network structure learning problem. Once you have that you will be able to efficiently answer questions like p(A=1|B=0 and C=1) and so on.

梦里的微风 2024-12-06 22:04:07

如果您有每个项目的数量,那么您可以考虑“高效用项目集挖掘”。这是项集挖掘的问题,但适用于每笔交易中的项可以有数量并且每个项可以有一个重量的情况。

如果您只使用基本的 Apriori,那么您将丢失有关数量的信息。

If you have quantities for each items, then you could consider "high utility itemset mining". It is the problem of itemset mining but adapted for the case where items can have quantities in each transaction and also each item can have a weight.

If you just use the basic Apriori, then you would loose the information about quantities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文