使用频繁项集挖掘来构建关联规则?
我对这个领域和术语都很陌生,所以如果我在某个地方出错,请随时提出建议。我有两个这样的数据集:
数据集 1:
A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E
我解释它的方式是在某个时间点,(A,B,C,E)一起发生,(A,C),(A,C,D,E)也一起发生)等。
数据集 2:
5A 1B 5C 0 2E
4A 0 5C 0 0
2A 0 1C 4D 4E
3A 0 4C 0 3E
我解释的方式是在某个时间点,A 出现 5 次,B 出现 1 次,C 出现 5 次,E 出现 2 次,依此类推。
我试图找出哪些项目同时发生,如果可能的话,还找出其原因和结果。为此,我不明白如何使用这两个数据集(或者一个数据集是否足够)。最好有一个关于这方面的好的教程,但我的主要问题是使用哪个数据集以及如何继续(i)构建频繁项集和(ii)构建它们之间的关联规则。
有人可以给我指出一个实用的教程/示例(最好是Python),或者至少简短地解释一下如何解决这个问题?
I am new to this area as well as the terminology so please feel free to suggest if I go wrong somewhere. I have two datasets like this:
Dataset 1:
A B C 0 E
A 0 C 0 0
A 0 C D E
A 0 C 0 E
The way I interpret this is at some point in time, (A,B,C,E) occurred together and so did (A,C), (A,C,D,E) etc.
Dataset 2:
5A 1B 5C 0 2E
4A 0 5C 0 0
2A 0 1C 4D 4E
3A 0 4C 0 3E
The way I interpret this is at some point in time, 5 occurrences of A, 1 occurrence of B, 5 occurrences of C and 2 occurrences of E happened and so on.
I am trying to find what items occur together and if possible, also find out the cause and effect for this. For this, I am not understanding how to go about using both the datasets (or if one is enough). It would be good to have a good tutorial on this but my primary question is which dataset to utilize and how to proceed in (i) building a frequent itemset and (ii) building association rules between them.
Can someone point me to a practical tutorials/examples (preferably in Python) or at least explain in brief words on how to approach this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有关关联规则的一些理论事实:
要查找关联规则,可以使用 apriori 算法。已经存在许多 python 实现,尽管其中大多数在实际使用中效率不高:
或使用 Orange 数据挖掘库,它具有良好的关联规则库。
使用示例:
要了解更多关于关联规则/频繁项挖掘的知识,那么我选择的书籍是:
没有捷径。
Some theoretical facts about association rules:
To find association rules, you can use apriori algorithm. There already exists many python implementation, although most of them are not efficient for practical usage:
or use Orange data mining library, which has a good library for association rules.
Usage example:
To learn more about association rules/frequent item mining, then my selection of books are:
There is no short way.
处理此类问题的一个巧妙方法是使用贝叶斯网络。特别是作为贝叶斯网络结构的学习问题。一旦掌握了这些,您将能够有效地回答 p(A=1|B=0 和 C=1) 等问题。
It seems like a neat way to handle this type of problems is using a Bayesian network. In particular as a Bayesian network structure learning problem. Once you have that you will be able to efficiently answer questions like p(A=1|B=0 and C=1) and so on.
如果您有每个项目的数量,那么您可以考虑“高效用项目集挖掘”。这是项集挖掘的问题,但适用于每笔交易中的项可以有数量并且每个项可以有一个重量的情况。
如果您只使用基本的 Apriori,那么您将丢失有关数量的信息。
If you have quantities for each items, then you could consider "high utility itemset mining". It is the problem of itemset mining but adapted for the case where items can have quantities in each transaction and also each item can have a weight.
If you just use the basic Apriori, then you would loose the information about quantities.