如何搜索包含特定数据对的数据框架子组?
我有一个数据集,其中包含订单和这些订单中的项目。我想找到的是哪个项目对在一起存在。例如,我想找到与6395477和6391546在一起的订单,在这种情况下,订单20220627-0015和20220627-0014拥有它们。
ORDER_ID ITEM_ID
20220627-0015 6395477
20220627-0015 6391546
20220627-0015 6385826
20220627-0015 6385822
20220627-0015 6385810
20220627-0015 6385797
20220627-0015 6379808
20220627-0014 6395477
20220627-0014 6391546
20220627-0014 6386347
20220627-0014 6385814
20220627-0014 6385812
20220627-0014 6385810
20220627-0014 6384898
20220627-0013 6395592
20220627-0013 6395583
20220627-0013 6394927
20220627-0013 6386403
20220627-0013 6384293
20220627-0013 6380162
20220627-0013 6379830
20220627-0013 6379828
我使用代码从项目
列表中创建了项目对:
x= [tuple(x) for i,x in items.iteritems() for x in combinations(x,2) ]
然后我找到了包含x
的1900个元素的orders
以及代码:
orders=df.loc[df['Item_Id'].isin(x[1900])].groupby('Order_Id', as_index=False).nunique()
order_list=orders.loc[orders['Item_Id'] == len(x[1901]), 'Order_Id'].to_list()
我得到的结果是我想要的:
[91158728,91158755,91158756,'20220627-0015']
我创建了一个函数,可以轻松地通过项目对循环并进行了测试:
def ordfunc(df1,ls1,i):
orders = df1.loc[df1['Item_Id'].isin(ls1[i])].groupby('Order_Id', as_index=False).nunique()
order_list=orders.loc[orders['Item_Id'] == len(ls1[i]), 'Order_Id'].to_list()
return order_list
我试图在下面的循环中获得结果,但是内核无法处理它,
final_list = [list(ordfunc(df,x,i)) for i in range(len(x)) ]
我该如何创建循环迭代大约1.7 m的项目对,并给出包含这些对的列表吗?
I have a dataset which contains orders, and items in those orders. What I want to find is which item pairs exist together in which group. For example, I want to find orders which has 6395477 and 6391546 together, in this case order 20220627-0015 and 20220627-0014 have them.
ORDER_ID ITEM_ID
20220627-0015 6395477
20220627-0015 6391546
20220627-0015 6385826
20220627-0015 6385822
20220627-0015 6385810
20220627-0015 6385797
20220627-0015 6379808
20220627-0014 6395477
20220627-0014 6391546
20220627-0014 6386347
20220627-0014 6385814
20220627-0014 6385812
20220627-0014 6385810
20220627-0014 6384898
20220627-0013 6395592
20220627-0013 6395583
20220627-0013 6394927
20220627-0013 6386403
20220627-0013 6384293
20220627-0013 6380162
20220627-0013 6379830
20220627-0013 6379828
I created the item pairs from the items
list using the code:
x= [tuple(x) for i,x in items.iteritems() for x in combinations(x,2) ]
and then I found the orders
that contains 1900th elements of the x
together with the code:
orders=df.loc[df['Item_Id'].isin(x[1900])].groupby('Order_Id', as_index=False).nunique()
order_list=orders.loc[orders['Item_Id'] == len(x[1901]), 'Order_Id'].to_list()
and I get the result as I wanted to:
[91158728,91158755,91158756,'20220627-0015']
I created a function to easily loop over the item pairs and tested it:
def ordfunc(df1,ls1,i):
orders = df1.loc[df1['Item_Id'].isin(ls1[i])].groupby('Order_Id', as_index=False).nunique()
order_list=orders.loc[orders['Item_Id'] == len(ls1[i]), 'Order_Id'].to_list()
return order_list
I tried to get the result with the loop below but kernel couldn't handle it
final_list = [list(ordfunc(df,x,i)) for i in range(len(x)) ]
How can I create the loop that iterates over roughly 1.7 m item pairs and gives the list of list that contains those pairs together?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为此,我创建了您要检查的项目的列表:
然后我使用
.loc
with.isin()
滤除包含这2个产品之一的行上:然后,您可以将其分组为在
order_id
上进行上,并使用nunique()
给出唯一item_ids
的数量的计数:然后
order> order> order_ids
包含所有内容检查的项目是item_id == len(item_ids)
的行,可以按照下面的方式返回order> order> order_ids
的列表:To do this, I created a list of the items you want to check for:
Then I filtered the dataframe using
.loc
with.isin()
to filter on the rows that contain one of those 2 products:You can then group this up on
ORDER_ID
, and usenunique()
to give the count of the number of uniqueITEM_IDs
:Then the
ORDER_IDs
that contain all of the checked items are the rows whereITEM_ID == len(ITEM_IDS)
, which can be done as below, to return a list ofORDER_IDs
:假设每个订单只有一个相同的项目,您可以使用:
您也可以使用
df.Where
:Assuming there is only one same item per order, you can use:
You can also use
df.where
:未测试连续订单的解决方案:
如果需要所有对单独数据帧的对:
Solution for not test consecutive orders:
If need all pairs to separately DataFrame: