先验算法

发布于 2024-07-30 19:21:04 字数 88 浏览 4 评论 0原文

我之前曾多次听说过 Apriori 算法,但从未有时间或机会深入研究它,有人可以用简单的方式向我解释该算法的工作原理吗? 另外,一个基本的例子会让我更容易理解。

I've heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings of this algorithm? Also, a basic example would make it a lot easier for me to understand.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

玻璃人 2024-08-06 19:21:07

Apriori 的最佳介绍可以从本书下载:

http:// www-users.cs.umn.edu/~kumar/dmbook/index.php

你可以免费下载第6章,它非常清楚地解释了Apriori。

另外,如果你想下载Java版的Apriori和其他频繁项集挖掘算法,可以查看我的网站:

http://www.philippe-fournier-viger.com/spmf/

The best introduction to Apriori can be downloaded from this book:

http://www-users.cs.umn.edu/~kumar/dmbook/index.php

you can download the chapter 6 for free which explain Apriori very clearly.

Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:

http://www.philippe-fournier-viger.com/spmf/

2024-08-06 19:21:07

好吧,我假设你已经阅读了维基百科条目,但你说“一个基本的例子会让我更容易理解”。 维基百科上就有这样的内容,所以我假设您还没有阅读过它并建议您阅读。

阅读维基百科文章。

Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.

Read the wikipedia article.

一场春暖 2024-08-06 19:21:06

Apriori 算法

它是一种用于数据集中频繁模式挖掘的候选生成和测试方法。 有两件事你必须记住。

Apriori 剪枝原则 - 如果任何项集不频繁,则不应生成/测试其超集。

Apriori 属性 - 给定的 (k+1)-itemset 是候选 (k+1)-itemset,仅当其 >k-itemset 子集是频繁的。

现在,这是分 4 个步骤的先验算法。

  1. 最初,扫描数据库/数据集一次以获取频繁的1-itemset
  2. 从长度k的频繁项集中生成长度k+1候选项集。
  3. 针对数据库/数据集测试候选人。
  4. 当无法生成频繁集或候选集时终止。

已解决的示例

假设有一个交易数据库,如下所示,其中包含 4 笔交易,包括它们的交易 ID 和使用它们购买的商品。 假设最小支持度 - min_sup2。 术语“支持”是指存在/包含某个项集的事务数量。

事务数据库

tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E

现在,让我们通过第一次扫描数据库来创建候选1-itemsets。 如下简单地将其称为C_1的集合。

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {D}   |  1
  {E}   |  3

如果我们使用 min_sup 进行测试,我们可以看到 {D} 不满足 2min_sup。 因此,它不会包含在频繁1-itemset中,我们简单地将其称为L_1集合,如下所示。

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {E}   |  3

现在,让我们第二次扫描数据库,并生成候选 2-itemsets,我们将其简单地称为 C_2 集合,如下所示。

itemset | sup
-------------
  {A,B} |  1
  {A,C} |  2
  {A,E} |  1
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

如您所见,{A,B}{A,E} 项集不满足 2min_sup code> 因此它们不会包含在频繁的 2-itemset 中,L_2

itemset | sup
-------------
  {A,C} |  2
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

现在让我们对数据库进行第三次扫描并获取候选 3-itemsets< /code>、C_3 如下。

itemset  | sup
-------------
 {A,B,C} |  1
 {A,B,E} |  1
 {A,C,E} |  1
 {B,C,E} |  2

您可以看到, {A,B,C}{A,B,E}{A,C,E} 确实如此不满足2min_sup。 因此它们不会包含在频繁3-itemsetL_3中,如下所示。

itemset  | sup
-------------
 {B,C,E} |  2

现在,最后,我们可以计算 支持 (supp)置信度 (conf)提升(兴趣度值) 值关联/相关规则可以由项集{B,C,E}生成,如下所示。

         Rule       | supp  |  conf   |  lift
    -------------------------------------------
     B ->  C & E    |  50%  |  66.67% |  1.33
     E ->  B & C    |  50%  |  66.67% |  1.33
     C ->  E & B    |  50%  |  66.67% |  1.77
     B & C ->  E    |  50%  |  100%   |  1.33
     E & B ->  C    |  50%  |  66.67% |  1.77
     C & E ->  B    |  50%  |  100%   |  1.33

Apriori Algorithm

It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.

Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.

Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.

Now, here is the apriori algorithm in 4 steps.

  1. Initially, scan the database/dataset once to get the frequent 1-itemset.
  2. Generate length k+1 candidate itemsets from length k frequent itemsets.
  3. Test the candidates against the database/dataset.
  4. Terminate when no frequent or candidate set can be generated.

Solved Example

Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.

Transaction DB

tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E

Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {D}   |  1
  {E}   |  3

If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {E}   |  3

Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.

itemset | sup
-------------
  {A,B} |  1
  {A,C} |  2
  {A,E} |  1
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2

itemset | sup
-------------
  {A,C} |  2
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.

itemset  | sup
-------------
 {A,B,C} |  1
 {A,B,E} |  1
 {A,C,E} |  1
 {B,C,E} |  2

You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.

itemset  | sup
-------------
 {B,C,E} |  2

Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.

         Rule       | supp  |  conf   |  lift
    -------------------------------------------
     B ->  C & E    |  50%  |  66.67% |  1.33
     E ->  B & C    |  50%  |  66.67% |  1.33
     C ->  E & B    |  50%  |  66.67% |  1.77
     B & C ->  E    |  50%  |  100%   |  1.33
     E & B ->  C    |  50%  |  66.67% |  1.77
     C & E ->  B    |  50%  |  100%   |  1.33
等待圉鍢 2024-08-06 19:21:06

请参阅数据挖掘中的十大算法(免费访问)或数据挖掘中的十大算法。 后者给出了算法的详细描述,以及如何获得优化实现的细节。

See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文