当前位置：文江博客话题详情

先验算法

发布于 2024-07-30 19:21:04 字数 88 浏览 4 评论 0原文

我之前曾多次听说过 Apriori 算法，但从未有时间或机会深入研究它，有人可以用简单的方式向我解释该算法的工作原理吗？另外，一个基本的例子会让我更容易理解。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玻璃人 2024-08-06 19:21:07

Apriori 的最佳介绍可以从本书下载：

http:// www-users.cs.umn.edu/~kumar/dmbook/index.php

你可以免费下载第6章，它非常清楚地解释了Apriori。

另外，如果你想下载Java版的Apriori和其他频繁项集挖掘算法，可以查看我的网站：

http://www.philippe-fournier-viger.com/spmf/

回复收藏 0 原文

丧 2024-08-06 19:21:07

好吧，我假设你已经阅读了维基百科条目，但你说“一个基本的例子会让我更容易理解”。维基百科上就有这样的内容，所以我假设您还没有阅读过它并建议您阅读。

阅读维基百科文章。

回复收藏 0 原文

一场春暖 2024-08-06 19:21:06

Apriori 算法

它是一种用于数据集中频繁模式挖掘的候选生成和测试方法。有两件事你必须记住。

Apriori 剪枝原则 - 如果任何项集不频繁，则不应生成/测试其超集。

Apriori 属性 - 给定的 (k+1)-itemset 是候选 (k+1)-itemset，仅当其 >k-itemset 子集是频繁的。

现在，这是分 4 个步骤的先验算法。

最初，扫描数据库/数据集一次以获取频繁的1-itemset。
从长度k的频繁项集中生成长度k+1候选项集。
针对数据库/数据集测试候选人。
当无法生成频繁集或候选集时终止。

已解决的示例

假设有一个交易数据库，如下所示，其中包含 4 笔交易，包括它们的交易 ID 和使用它们购买的商品。假设最小支持度 - min_sup 为 2。术语“支持”是指存在/包含某个项集的事务数量。

事务数据库

tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E

现在，让我们通过第一次扫描数据库来创建候选1-itemsets。如下简单地将其称为C_1的集合。

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {D}   |  1
  {E}   |  3

如果我们使用 min_sup 进行测试，我们可以看到 {D} 不满足 2 的 min_sup。因此，它不会包含在频繁1-itemset中，我们简单地将其称为L_1集合，如下所示。

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {E}   |  3

现在，让我们第二次扫描数据库，并生成候选 2-itemsets，我们将其简单地称为 C_2 集合，如下所示。

itemset | sup
-------------
  {A,B} |  1
  {A,C} |  2
  {A,E} |  1
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

如您所见，{A,B} 和 {A,E} 项集不满足 2min_sup code> 因此它们不会包含在频繁的 2-itemset 中，L_2

itemset | sup
-------------
  {A,C} |  2
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

现在让我们对数据库进行第三次扫描并获取候选 3-itemsets< /code>、C_3 如下。

itemset  | sup
-------------
 {A,B,C} |  1
 {A,B,E} |  1
 {A,C,E} |  1
 {B,C,E} |  2

您可以看到， {A,B,C}、{A,B,E} 和 {A,C,E} 确实如此不满足2的min_sup。因此它们不会包含在频繁3-itemset，L_3中，如下所示。

itemset  | sup
-------------
 {B,C,E} |  2

现在，最后，我们可以计算 的支持 (supp)、置信度 (conf) 和 提升（兴趣度值） 值关联/相关规则可以由项集{B,C,E}生成，如下所示。

         Rule       | supp  |  conf   |  lift
    -------------------------------------------
     B ->  C & E    |  50%  |  66.67% |  1.33
     E ->  B & C    |  50%  |  66.67% |  1.33
     C ->  E & B    |  50%  |  66.67% |  1.77
     B & C ->  E    |  50%  |  100%   |  1.33
     E & B ->  C    |  50%  |  66.67% |  1.77
     C & E ->  B    |  50%  |  100%   |  1.33

Apriori Algorithm

It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.

Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.

Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.

Now, here is the apriori algorithm in 4 steps.

Initially, scan the database/dataset once to get the frequent 1-itemset.
Generate length k+1 candidate itemsets from length k frequent itemsets.
Test the candidates against the database/dataset.
Terminate when no frequent or candidate set can be generated.

Solved Example

Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.

Transaction DB

tid  | items
-------------
10   | A,C,D
20   | B,C,E
30   | A,B,C,E
40   | B,E

Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {D}   |  1
  {E}   |  3

If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.

itemset | sup
-------------
  {A}   |  2
  {B}   |  3
  {C}   |  3
  {E}   |  3

Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.

itemset | sup
-------------
  {A,B} |  1
  {A,C} |  2
  {A,E} |  1
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2

itemset | sup
-------------
  {A,C} |  2
  {B,C} |  2
  {B,E} |  3
  {C,E} |  2

Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.

itemset  | sup
-------------
 {A,B,C} |  1
 {A,B,E} |  1
 {A,C,E} |  1
 {B,C,E} |  2

You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.

itemset  | sup
-------------
 {B,C,E} |  2

Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.

         Rule       | supp  |  conf   |  lift
    -------------------------------------------
     B ->  C & E    |  50%  |  66.67% |  1.33
     E ->  B & C    |  50%  |  66.67% |  1.33
     C ->  E & B    |  50%  |  66.67% |  1.77
     B & C ->  E    |  50%  |  100%   |  1.33
     E & B ->  C    |  50%  |  66.67% |  1.77
     C & E ->  B    |  50%  |  100%   |  1.33

回复收藏 0 原文