从Python中的稀疏SKU数据创建邻接矩阵
我有大约6000个SKU和250,000 OBS的电子商务数据。下面简单的版本,但稀疏得多。每行只有一个SKU,因为每行都是交易。
我拥有的:
|Index| ID | SKU1| SKU2 | SKU3|
|:----|:----|:----|:-----|:----|
| 1 | 55 | 1 | 0 | 0 |
| 2 | 55 | 0 | 1 | 0 |
| 3 | 55 | 0 | 0 | 1 |
| 4 | 66 | 0 | 1 | 0 |
| 5 | 66 | 1 | 0 | 0 |
| 6 | 77 | 0 | 1 | 0 |
我想创建一个加权的无向邻接矩阵,以便我可以在市场篮子上进行一些图形分析。看起来像下面,在这里,Sku2和Sku1在55和66的篮子中一起购买,因此总重量为2。
我想要的是:
|Index| SKU1| SKU2| SKU3 | SKU4|
|:----|:----|:----|:-----|:----|
| SKU1| 0 | 2 | 1 | 0 |
| SKU2| 2 | 0 | 0 | 0 |
| SKU3| 1 | 0 | 0 | 0 |
| SKU4| 0 | 0 | 0 | 0 |
我尝试通过原始DF进行循环迭代。但是它立即崩溃了。
理想情况下,我将通过ID列折叠第一个数据帧,但不会汇总,因为同一项目和相同ID没有重复的交易。但是,当我尝试使用df.groupby(['id']))。计数()
时,我会得到以下内容。当我删除.count()时,没有输出。我敢肯定,还有另一种方法可以做到这一点,但似乎在文档中找不到它。
我尝试的是:df.groupby(['id'])。count()
| ID | SKU1| SKU2 | SKU3|
|:----|:----|:---- |:----|
| 55 | 3 | 3 | 3 |
| 66 | 2 | 2 | 2 |
| 77 | 1 | 1 | 1 |
有人知道我如何在不立即崩溃的情况下生成稀疏矩阵吗?
I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.
What I have:
|Index| ID | SKU1| SKU2 | SKU3|
|:----|:----|:----|:-----|:----|
| 1 | 55 | 1 | 0 | 0 |
| 2 | 55 | 0 | 1 | 0 |
| 3 | 55 | 0 | 0 | 1 |
| 4 | 66 | 0 | 1 | 0 |
| 5 | 66 | 1 | 0 | 0 |
| 6 | 77 | 0 | 1 | 0 |
I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.
What I want:
|Index| SKU1| SKU2| SKU3 | SKU4|
|:----|:----|:----|:-----|:----|
| SKU1| 0 | 2 | 1 | 0 |
| SKU2| 2 | 0 | 0 | 0 |
| SKU3| 1 | 0 | 0 | 0 |
| SKU4| 0 | 0 | 0 | 0 |
I have tried a for loop iterating through the original DF but it crashes immediately.
Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count()
I get the following. When I remove .count() there is no output. I'm sure there is another way to do this but can't seem to find it in the documentation.
What I tried: df.groupby(['ID']).count()
| ID | SKU1| SKU2 | SKU3|
|:----|:----|:---- |:----|
| 55 | 3 | 3 | 3 |
| 66 | 2 | 2 | 2 |
| 77 | 1 | 1 | 1 |
Anyone know how I can generate the sparse matrix without immediately crashing my computer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
计数还计数零。取而代之的是汇总,然后转换为0和1。
将其变成出现表,并出于任何原因将对角线填充0。
将其返回到数据框架中
Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.
Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.
Turn it back into a dataframe