先验算法
我之前曾多次听说过 Apriori 算法,但从未有时间或机会深入研究它,有人可以用简单的方式向我解释该算法的工作原理吗? 另外,一个基本的例子会让我更容易理解。
I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Apriori 的最佳介绍可以从本书下载:
http:// www-users.cs.umn.edu/~kumar/dmbook/index.php
你可以免费下载第6章,它非常清楚地解释了Apriori。
另外,如果你想下载Java版的Apriori和其他频繁项集挖掘算法,可以查看我的网站:
http://www.philippe-fournier-viger.com/spmf/
The best introduction to Apriori can be downloaded from this book:
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
you can download the chapter 6 for free which explain Apriori very clearly.
Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/
好吧,我假设你已经阅读了维基百科条目,但你说“一个基本的例子会让我更容易理解”。 维基百科上就有这样的内容,所以我假设您还没有阅读过它并建议您阅读。
阅读维基百科文章。
Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.
Apriori 算法
它是一种用于数据集中频繁模式挖掘的候选生成和测试方法。 有两件事你必须记住。
Apriori 剪枝原则 - 如果任何项集不频繁,则不应生成/测试其超集。
Apriori 属性 - 给定的
(k+1)-itemset
是候选(k+1)-itemset
,仅当其>k-itemset
子集是频繁的。现在,这是分 4 个步骤的先验算法。
1-itemset
。k
的频繁项集中生成长度k+1
候选项集。已解决的示例
假设有一个交易数据库,如下所示,其中包含 4 笔交易,包括它们的交易 ID 和使用它们购买的商品。 假设最小支持度 -
min_sup
为2
。 术语“支持”是指存在/包含某个项集的事务数量。事务数据库
现在,让我们通过第一次扫描数据库来创建候选
1-itemsets
。 如下简单地将其称为C_1
的集合。如果我们使用
min_sup
进行测试,我们可以看到{D}
不满足2
的min_sup
。 因此,它不会包含在频繁1-itemset
中,我们简单地将其称为L_1
集合,如下所示。现在,让我们第二次扫描数据库,并生成候选
2-itemsets
,我们将其简单地称为C_2
集合,如下所示。如您所见,
{A,B}
和{A,E}
项集不满足2
min_sup code> 因此它们不会包含在频繁的2-itemset
中,L_2
现在让我们对数据库进行第三次扫描并获取候选
3-itemsets< /code>、
C_3
如下。您可以看到,
{A,B,C}
、{A,B,E}
和{A,C,E}
确实如此不满足2
的min_sup
。 因此它们不会包含在频繁3-itemset
,L_3
中,如下所示。现在,最后,我们可以计算 的
支持 (supp)
、置信度 (conf)
和提升(兴趣度值)
值关联/相关规则可以由项集{B,C,E}
生成,如下所示。Apriori Algorithm
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given
(k+1)-itemset
is a candidate(k+1)-itemset
only if everyone of itsk-itemset
subsets are frequent.Now, here is the apriori algorithm in 4 steps.
1-itemset
.k+1
candidate itemsets from lengthk
frequent itemsets.Solved Example
Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support -
min_sup
is2
. The term support is the number of transactions in which a certain itemset is present/included.Transaction DB
Now, let's create the candidate
1-itemsets
by the 1st scan of the DB. It is simply called as the set ofC_1
as follows.If we test this with
min_sup
, we can see{D}
does not satisfy themin_sup
of2
. So, it will not be included in the frequent1-itemset
, which we simply call as the set ofL_1
as follows.Now, let's scan the DB for the 2nd time, and generate candidate
2-itemsets
, which we simply call as the set ofC_2
as follows.As you can see,
{A,B}
and{A,E}
itemsets do not satisfy themin_sup
of2
and hence they will not be included in the frequent2-itemset
,L_2
Now let's do a 3rd scan of the DB and get candidate
3-itemsets
,C_3
as follows.You can see that,
{A,B,C}
,{A,B,E}
and{A,C,E}
does not satisfymin_sup
of2
. So they will not be included in frequent3-itemset
,L_3
as follows.Now, finally, we can calculate the
support (supp)
,confidence (conf)
andlift (interestingness value)
values of the Association/Correlation Rules that can be generated by the itemset{B,C,E}
as follows.请参阅数据挖掘中的十大算法(免费访问)或数据挖掘中的十大算法。 后者给出了算法的详细描述,以及如何获得优化实现的细节。
See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.