如何从大型事务数据文件中查找最大频繁项集

发布于 2024-08-30 20:10:50 字数 1238 浏览 2 评论 0原文

我的输入文件包含大量交易,例如

交易 ID 项目,

T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee, 
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice

我希望每个唯一项目都出现,例如

Item Name Count
Bread  9
Milk   8
Coffee 7
Juice  6

alt text http://www.ade-technologies.com/OrderFP_Tree.jpg

从那我现在想要一个 fp 树,通过遍历这棵树,我想要最大频繁项集,如下

方法的基本思想就是从下到上在每一“层”布置节点。 “层”的概念与树中层的常见概念不同。 “层”中的节点是指对应于同一项并且位于“头表”的链表中的节点。对于“层”中的节点,将使用NBN方法沿着链表从左到右布置节点。要使用 NBN 方法,将向有序 FP-Tree 中的每个节点添加两个额外字段。节点N的字段tag存储N是否为最大频繁项集的信息,字段count'存储左侧节点的支持度信息。

图中,第一个要处理的节点是“juice:2”。如果 min_sup 等于或小于 2,则“面包、牛奶、咖啡、果汁”是最大频繁项集。首先输出juice:2,并将“coffee:3”的字段标签设置为“false”(每个节点的字段标签初始为“true”)。接下来检查正确的四个项目集juice:1是否是juice:2的子集。如果项集一节点“juice:1”对应的是juice:2的子集,则设置该节点的字段标签为“false”。在接下来的过程中,当已处理节点的字段标记为FALSE时,我们可以省略相同标记后的节点。如果min_sup大于2,则检查右边的四个juice:1是否是juice:2的子集。如果第一个节点“juice:1”对应的项集是juice:2的子集,则将该节点的字段count'设置为前一个count'与2之和。所有节点“juice”处理完毕后,开始处理节点“咖啡:3”。

欢迎任何建议或可用的源代码。

提前致谢

I have the input file contains large amount of transactions like

Transaction ID Items

T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee, 
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice

i want the occurrence of every unique item like

Item Name Count
Bread  9
Milk   8
Coffee 7
Juice  6

alt text http://www.ade-technologies.com/OrderFP_Tree.jpg

and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows

The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “layer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.

In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”.

Any suggestions or available source code, welcome.

thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

饮湿 2024-09-06 20:10:50

这可以直接在 SQL 中完成

CREATE TABLE dbo.TestTable
( FIELD1 VARCHAR(256) )
GO

INSERT INTO dbo.TestTable(FIELD1) VALUES
('T1 Bread, milk, coffee, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T2 Juice, milk, coffee')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T3 Bread, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T4 Coffee, milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T5 Bread, Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T6 Coffee, Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T7 Coffee, Bread, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T8 Bread, Milk, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T9 Milk, Bread, Coffee,')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T10 Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T11 Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T12 Milk, Coffee, Bread, Juice')
GO

--CREATE INDEX TestIndex ON dbo.TestTable(FIELD1)
--GO

;WITH Numbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
    FROM dbo.TestTable T1
    CROSS JOIN dbo.TestTable T2
),
Base AS
(
    SELECT SUBSTRING(FIELD1, 0, CHARINDEX(' ', FIELD1, 0)) AS TRANID,
    UPPER(REPLACE(SUBSTRING(FIELD1, CHARINDEX(' ', FIELD1, 0)+1, DATALENGTH(FIELD1)), ' ', '')) AS ITEMS
    FROM dbo.TestTable
),
Split AS
(
    SELECT TRANID, ITEMS, N, SUBSTRING(ITEMS, N, CHARINDEX(',', ITEMS + ',', N) - N) AS ELEMENT
    FROM Base 
    JOIN Numbers ON N <= DATALENGTH(Base.ITEMS) + 1
    AND SUBSTRING(',' + Base.ITEMS, N, 1) = ','
)
SELECT ELEMENT, COUNT(*) AS TOTAL
FROM Split
GROUP BY ELEMENT
ORDER BY TOTAL DESC

这将返回

BREAD   9
MILK    8
COFFEE  7
JUICE   6
        1  

空条目来自事务 T9 末尾的逗号

T9 Milk, Bread, Coffee, 

This can be done directly in SQL

CREATE TABLE dbo.TestTable
( FIELD1 VARCHAR(256) )
GO

INSERT INTO dbo.TestTable(FIELD1) VALUES
('T1 Bread, milk, coffee, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T2 Juice, milk, coffee')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T3 Bread, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T4 Coffee, milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T5 Bread, Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T6 Coffee, Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T7 Coffee, Bread, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T8 Bread, Milk, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T9 Milk, Bread, Coffee,')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T10 Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T11 Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T12 Milk, Coffee, Bread, Juice')
GO

--CREATE INDEX TestIndex ON dbo.TestTable(FIELD1)
--GO

;WITH Numbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
    FROM dbo.TestTable T1
    CROSS JOIN dbo.TestTable T2
),
Base AS
(
    SELECT SUBSTRING(FIELD1, 0, CHARINDEX(' ', FIELD1, 0)) AS TRANID,
    UPPER(REPLACE(SUBSTRING(FIELD1, CHARINDEX(' ', FIELD1, 0)+1, DATALENGTH(FIELD1)), ' ', '')) AS ITEMS
    FROM dbo.TestTable
),
Split AS
(
    SELECT TRANID, ITEMS, N, SUBSTRING(ITEMS, N, CHARINDEX(',', ITEMS + ',', N) - N) AS ELEMENT
    FROM Base 
    JOIN Numbers ON N <= DATALENGTH(Base.ITEMS) + 1
    AND SUBSTRING(',' + Base.ITEMS, N, 1) = ','
)
SELECT ELEMENT, COUNT(*) AS TOTAL
FROM Split
GROUP BY ELEMENT
ORDER BY TOTAL DESC

This returns

BREAD   9
MILK    8
COFFEE  7
JUICE   6
        1  

The empty entry comes from the comma at the end of transaction T9

T9 Milk, Bread, Coffee, 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文