从Python中的稀疏SKU数据创建邻接矩阵

发布于 2025-02-03 19:46:26 字数 1200 浏览 4 评论 0原文

我有大约6000个SKU和250,000 OBS的电子商务数据。下面简单的版本，但稀疏得多。每行只有一个SKU，因为每行都是交易。

我拥有的：

|Index| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:----|:-----|:----|
| 1   | 55  |  1  |  0   |  0  |
| 2   | 55  |  0  |  1   |  0  |
| 3   | 55  |  0  |  0   |  1  |
| 4   | 66  |  0  |  1   |  0  |
| 5   | 66  |  1  |  0   |  0  |
| 6   | 77  |  0  |  1   |  0  |

我想创建一个加权的无向邻接矩阵，以便我可以在市场篮子上进行一些图形分析。看起来像下面，在这里，Sku2和Sku1在55和66的篮子中一起购买，因此总重量为2。

我想要的是：

|Index| SKU1| SKU2| SKU3 | SKU4| 
|:----|:----|:----|:-----|:----|
| SKU1|  0  |  2  |  1   |  0  |
| SKU2|  2  |  0  |  0   |  0  |
| SKU3|  1  |  0  |  0   |  0  |
| SKU4|  0  |  0  |  0   |  0  |

我尝试通过原始DF进行循环迭代。但是它立即崩溃了。

理想情况下，我将通过ID列折叠第一个数据帧，但不会汇总，因为同一项目和相同ID没有重复的交易。但是，当我尝试使用df.groupby（['id']））。计数（）时，我会得到以下内容。当我删除.count（）时，没有输出。我敢肯定，还有另一种方法可以做到这一点，但似乎在文档中找不到它。

我尝试的是：df.groupby（['id']）。count（）

| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:---- |:----|
| 55  |  3  |  3   |  3  |
| 66  |  2  |  2   |  2  |
| 77  |  1  |  1   |  1  |

有人知道我如何在不立即崩溃的情况下生成稀疏矩阵吗？

原文

I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.

What I have:

|Index| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:----|:-----|:----|
| 1   | 55  |  1  |  0   |  0  |
| 2   | 55  |  0  |  1   |  0  |
| 3   | 55  |  0  |  0   |  1  |
| 4   | 66  |  0  |  1   |  0  |
| 5   | 66  |  1  |  0   |  0  |
| 6   | 77  |  0  |  1   |  0  |

I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.

What I want:

|Index| SKU1| SKU2| SKU3 | SKU4| 
|:----|:----|:----|:-----|:----|
| SKU1|  0  |  2  |  1   |  0  |
| SKU2|  2  |  0  |  0   |  0  |
| SKU3|  1  |  0  |  0   |  0  |
| SKU4|  0  |  0  |  0   |  0  |

I have tried a for loop iterating through the original DF but it crashes immediately.

Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count() I get the following. When I remove .count() there is no output. I'm sure there is another way to do this but can't seem to find it in the documentation.

What I tried: df.groupby(['ID']).count()

| ID  | SKU1| SKU2 | SKU3| 
|:----|:----|:---- |:----|
| 55  |  3  |  3   |  3  |
| 66  |  2  |  2   |  2  |
| 77  |  1  |  1   |  1  |

Anyone know how I can generate the sparse matrix without immediately crashing my computer?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嗼ふ静 2025-02-10 19:46:26

计数还计数零。取而代之的是汇总，然后转换为0和1。

agg = df.groupby('ID').agg('sum')
agg = (agg > 0).astype(int)

    SKU1    SKU2    SKU3
ID          
55  1       1       1
66  1       1       0
77  0       1       0

将其变成出现表，并出于任何原因将对角线填充0。

occurrence = np.dot(agg.T, agg)
np.fill_diagonal(occurrence, 0)

将其返回到数据框架中

pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:])

        SKU1    SKU2    SKU3
SKU1    0       2       1
SKU2    2       0       1
SKU3    1       1       0

Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.

agg = df.groupby('ID').agg('sum')
agg = (agg > 0).astype(int)

    SKU1    SKU2    SKU3
ID          
55  1       1       1
66  1       1       0
77  0       1       0

Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.

occurrence = np.dot(agg.T, agg)
np.fill_diagonal(occurrence, 0)

Turn it back into a dataframe

pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:])

        SKU1    SKU2    SKU3
SKU1    0       2       1
SKU2    2       0       1
SKU3    1       1       0

回复收藏 0 原文

~没有更多了~