如何搜索包含特定数据对的数据框架子组？

发布于 2025-02-11 14:35:42 字数 1824 浏览 3 评论 0原文

我有一个数据集，其中包含订单和这些订单中的项目。我想找到的是哪个项目对在一起存在。例如，我想找到与6395477和6391546在一起的订单，在这种情况下，订单20220627-0015和20220627-0014拥有它们。

ORDER_ID     ITEM_ID
20220627-0015            6395477
20220627-0015            6391546
20220627-0015            6385826
20220627-0015            6385822
20220627-0015            6385810
20220627-0015            6385797
20220627-0015            6379808
20220627-0014            6395477
20220627-0014            6391546
20220627-0014            6386347
20220627-0014            6385814
20220627-0014            6385812
20220627-0014            6385810
20220627-0014            6384898
20220627-0013            6395592
20220627-0013            6395583
20220627-0013            6394927
20220627-0013            6386403
20220627-0013            6384293
20220627-0013            6380162
20220627-0013            6379830
20220627-0013            6379828

我使用代码从项目列表中创建了项目对：

x= [tuple(x) for i,x in items.iteritems() for x in combinations(x,2) ]

然后我找到了包含x的1900个元素的orders以及代码：

orders=df.loc[df['Item_Id'].isin(x[1900])].groupby('Order_Id', as_index=False).nunique()

order_list=orders.loc[orders['Item_Id'] == len(x[1901]), 'Order_Id'].to_list()

我得到的结果是我想要的：

[91158728,91158755,91158756,'20220627-0015']

我创建了一个函数，可以轻松地通过项目对循环并进行了测试：

def ordfunc(df1,ls1,i):
    orders = df1.loc[df1['Item_Id'].isin(ls1[i])].groupby('Order_Id', as_index=False).nunique()
    order_list=orders.loc[orders['Item_Id'] == len(ls1[i]), 'Order_Id'].to_list()
    return order_list

我试图在下面的循环中获得结果，但是内核无法处理它，

final_list = [list(ordfunc(df,x,i)) for i in range(len(x)) ]

我该如何创建循环迭代大约1.7 m的项目对，并给出包含这些对的列表吗？

原文

I have a dataset which contains orders, and items in those orders. What I want to find is which item pairs exist together in which group. For example, I want to find orders which has 6395477 and 6391546 together, in this case order 20220627-0015 and 20220627-0014 have them.

ORDER_ID     ITEM_ID
20220627-0015            6395477
20220627-0015            6391546
20220627-0015            6385826
20220627-0015            6385822
20220627-0015            6385810
20220627-0015            6385797
20220627-0015            6379808
20220627-0014            6395477
20220627-0014            6391546
20220627-0014            6386347
20220627-0014            6385814
20220627-0014            6385812
20220627-0014            6385810
20220627-0014            6384898
20220627-0013            6395592
20220627-0013            6395583
20220627-0013            6394927
20220627-0013            6386403
20220627-0013            6384293
20220627-0013            6380162
20220627-0013            6379830
20220627-0013            6379828

I created the item pairs from the items list using the code:

x= [tuple(x) for i,x in items.iteritems() for x in combinations(x,2) ]

and then I found the orders that contains 1900th elements of the x together with the code:

orders=df.loc[df['Item_Id'].isin(x[1900])].groupby('Order_Id', as_index=False).nunique()

order_list=orders.loc[orders['Item_Id'] == len(x[1901]), 'Order_Id'].to_list()

and I get the result as I wanted to:

[91158728,91158755,91158756,'20220627-0015']

I created a function to easily loop over the item pairs and tested it:

def ordfunc(df1,ls1,i):
    orders = df1.loc[df1['Item_Id'].isin(ls1[i])].groupby('Order_Id', as_index=False).nunique()
    order_list=orders.loc[orders['Item_Id'] == len(ls1[i]), 'Order_Id'].to_list()
    return order_list

I tried to get the result with the loop below but kernel couldn't handle it

final_list = [list(ordfunc(df,x,i)) for i in range(len(x)) ]

How can I create the loop that iterates over roughly 1.7 m item pairs and gives the list of list that contains those pairs together?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

诗酒趁年少 2025-02-18 14:35:42

为此，我创建了您要检查的项目的列表：

ITEM_IDS = [6395477, 6391546]

然后我使用 .loc with .isin（） 滤除包含这2个产品之一的行上：

df.loc[df['ITEM_ID'].isin(ITEM_IDS)]

        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

然后，您可以将其分组为在order_id上进行上，并使用 nunique（） 给出唯一item_ids的数量的计数：

orders = df.loc[df['ITEM_ID'].isin(ITEM_IDS)].groupby('ORDER_ID', as_index=False).nunique()

        ORDER_ID  ITEM_ID
0  20220627-0014        2
1  20220627-0015        2

然后order> order> order_ids包含所有内容检查的项目是item_id == len（item_ids）的行，可以按照下面的方式返回order> order> order_ids的列表：

orders.loc[orders['ITEM_ID'] == len(ITEM_IDS), 'ORDER_ID'].to_list()

['20220627-0014', '20220627-0015']

To do this, I created a list of the items you want to check for:

ITEM_IDS = [6395477, 6391546]

Then I filtered the dataframe using .loc with .isin() to filter on the rows that contain one of those 2 products:

df.loc[df['ITEM_ID'].isin(ITEM_IDS)]

        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

You can then group this up on ORDER_ID, and use nunique() to give the count of the number of unique ITEM_IDs:

orders = df.loc[df['ITEM_ID'].isin(ITEM_IDS)].groupby('ORDER_ID', as_index=False).nunique()

        ORDER_ID  ITEM_ID
0  20220627-0014        2
1  20220627-0015        2

Then the ORDER_IDs that contain all of the checked items are the rows where ITEM_ID == len(ITEM_IDS), which can be done as below, to return a list of ORDER_IDs:

orders.loc[orders['ITEM_ID'] == len(ITEM_IDS), 'ORDER_ID'].to_list()

['20220627-0014', '20220627-0015']

回复收藏 0 原文

怼怹恏 2025-02-18 14:35:42

假设每个订单只有一个相同的项目，您可以使用：

pair = (6395477, 6391546)
orders = df[df['ITEM_ID'].isin(pair)].value_counts('ORDER_ID').loc[lambda x: x == 2].index
print(orders)

# Output
Index(['20220627-0014', '20220627-0015'], dtype='object', name='ORDER_ID')

您也可以使用df.Where：

orders = (df.where(df['ITEM_ID'].isin(pair)).groupby('ORDER_ID')['ITEM_ID']
            .nunique().loc[lambda x: x == 2].index)

Assuming there is only one same item per order, you can use:

pair = (6395477, 6391546)
orders = df[df['ITEM_ID'].isin(pair)].value_counts('ORDER_ID').loc[lambda x: x == 2].index
print(orders)

# Output
Index(['20220627-0014', '20220627-0015'], dtype='object', name='ORDER_ID')

You can also use df.where:

orders = (df.where(df['ITEM_ID'].isin(pair)).groupby('ORDER_ID')['ITEM_ID']
            .nunique().loc[lambda x: x == 2].index)

回复收藏 0 原文

清风不识月 2025-02-18 14:35:42

未测试连续订单的解决方案：

v = (6395477, 6391546)
df1 = df[df['ITEM_ID'].isin(v)]
print (df1)
        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

orders = df1.loc[df1['ORDER_ID'].duplicated(keep=False), 'ORDER_ID'].unique().tolist()
print (orders)
['20220627-0015', '20220627-0014']

如果需要所有对单独数据帧的对：

from  itertools import combinations

L = [(i, tuple(y)) for i, x in df.groupby('ORDER_ID')['ITEM_ID'] for y in combinations(x, 2)]

df = pd.DataFrame(L, columns=['a','b'])
print (df)
                a                   b
0   20220627-0013  (6395592, 6395583)
1   20220627-0013  (6395592, 6394927)
2   20220627-0013  (6395592, 6386403)
3   20220627-0013  (6395592, 6384293)
4   20220627-0013  (6395592, 6380162)
..            ...                 ...
65  20220627-0015  (6385822, 6385797)
66  20220627-0015  (6385822, 6379808)
67  20220627-0015  (6385810, 6385797)
68  20220627-0015  (6385810, 6379808)
69  20220627-0015  (6385797, 6379808)

[70 rows x 2 columns]

Solution for not test consecutive orders:

v = (6395477, 6391546)
df1 = df[df['ITEM_ID'].isin(v)]
print (df1)
        ORDER_ID  ITEM_ID
0  20220627-0015  6395477
1  20220627-0015  6391546
7  20220627-0014  6395477
8  20220627-0014  6391546

orders = df1.loc[df1['ORDER_ID'].duplicated(keep=False), 'ORDER_ID'].unique().tolist()
print (orders)
['20220627-0015', '20220627-0014']

If need all pairs to separately DataFrame:

from  itertools import combinations

L = [(i, tuple(y)) for i, x in df.groupby('ORDER_ID')['ITEM_ID'] for y in combinations(x, 2)]

df = pd.DataFrame(L, columns=['a','b'])
print (df)
                a                   b
0   20220627-0013  (6395592, 6395583)
1   20220627-0013  (6395592, 6394927)
2   20220627-0013  (6395592, 6386403)
3   20220627-0013  (6395592, 6384293)
4   20220627-0013  (6395592, 6380162)
..            ...                 ...
65  20220627-0015  (6385822, 6385797)
66  20220627-0015  (6385822, 6379808)
67  20220627-0015  (6385810, 6385797)
68  20220627-0015  (6385810, 6379808)
69  20220627-0015  (6385797, 6379808)

[70 rows x 2 columns]

回复收藏 0 原文

~没有更多了~