熊猫:即使没有共同值

发布于 2025-02-10 18:56:30 字数 1418 浏览 1 评论 0原文

我有3个数据范围。我合并df1df2通过公共列。但是,我需要使用df3来查找在创建的GroupBy中看到的对的值。我也可以使用两列合并通过内部加入来完成这一部分,但是我还需要了解没有任何共同元素的条目。到目前为止,我能做的是在这里有一个模型问题:

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

chm_df = pd.merge(ch_df, hm_df, left_on='hotel', right_on='hotel')

pd.merge(left=chm_df, right=cm_df, on=['country','menu'], how='inner').groupby(['country','hotel'])['menu'].apply(list).reset_index(name='menu items')
  country     hotel          menu items
0   India    Oberoi        [ildi, soup]
1   India       Taj  [ildi, dosa, soup]
2   Italy  Marriott      [pasta, pizza]
3     USA    Hilton     [pizza, burger]
4     USA       Taj              [dosa]

我需要的条目是:

5   Italy  Oberoi                    []
...

一种低效的方法是在hm_df中添加到每个对的条目,一个允许的菜单项并在GroupBy之后将其删除。但是看起来很丑。什么是更优雅的方法?

I have 3 dataframes. I merge df1 and df2 through a common column. However, I need to use df3 to find what values are allowed for pairs seen in groupby created. I could get this part done too using 2-column merge through inner join, but I also need to se the entries that did not have any common elements. So far what I could do is represented with a model problem here:

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

chm_df = pd.merge(ch_df, hm_df, left_on='hotel', right_on='hotel')

pd.merge(left=chm_df, right=cm_df, on=['country','menu'], how='inner').groupby(['country','hotel'])['menu'].apply(list).reset_index(name='menu items')
  country     hotel          menu items
0   India    Oberoi        [ildi, soup]
1   India       Taj  [ildi, dosa, soup]
2   Italy  Marriott      [pasta, pizza]
3     USA    Hilton     [pizza, burger]
4     USA       Taj              [dosa]

What I need are entries such as:

5   Italy  Oberoi                    []
...

One inefficient way is to add to each pair in hm_df an allowed menu item and remove it after groupby. But it looks ugly. What is a more elegant method?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

少年亿悲伤 2025-02-17 18:56:30

如果需要,所有可能的组合都是可能的,可以使用 dataframe.unstack with dataframe.stack ,对于替换为空名单的不存在的值fill_value = [] parameter:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).unstack(fill_value=[]).stack().reset_index(name='menu items')
    
print (df)
   country     hotel          menu items
0    India    Hilton                  []
1    India  Marriott                  []
2    India    Oberoi        [ildi, soup]
3    India       Taj  [ildi, dosa, soup]
4    Italy    Hilton                  []
5    Italy  Marriott      [pasta, pizza]
6    Italy    Oberoi                  []
7    Italy       Taj                  []
8      USA    Hilton     [pizza, burger]
9      USA  Marriott                  []
10     USA    Oberoi                  []
11     USA       Taj              [dosa]

如果需要,则只有从<<代码> CHM_DF 转换为空列表:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).reindex(pd.MultiIndex.from_frame(ch_df), fill_value=[]).reset_index(name='menu items')
    
print (df)
  country     hotel          menu items
0   India       Taj  [ildi, dosa, soup]
1   India    Oberoi        [ildi, soup]
2   India    Hilton                  []
3     USA       Taj              [dosa]
4     USA    Hilton     [pizza, burger]
5   Italy    Oberoi                  []
6   Italy  Marriott      [pasta, pizza]

If need all possible combinations is possible use DataFrame.unstack withDataFrame.stack, for replace non exist values to empty lists use fill_value=[] parameter:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).unstack(fill_value=[]).stack().reset_index(name='menu items')
    
print (df)
   country     hotel          menu items
0    India    Hilton                  []
1    India  Marriott                  []
2    India    Oberoi        [ildi, soup]
3    India       Taj  [ildi, dosa, soup]
4    Italy    Hilton                  []
5    Italy  Marriott      [pasta, pizza]
6    Italy    Oberoi                  []
7    Italy       Taj                  []
8      USA    Hilton     [pizza, burger]
9      USA  Marriott                  []
10     USA    Oberoi                  []
11     USA       Taj              [dosa]

For completness if need only non exist values from chm_df convert to empty lists:

df = pd.merge(chm_df, cm_df, on=['country','menu']).groupby(['country','hotel'])['menu'].apply(list).reindex(pd.MultiIndex.from_frame(ch_df), fill_value=[]).reset_index(name='menu items')
    
print (df)
  country     hotel          menu items
0   India       Taj  [ildi, dosa, soup]
1   India    Oberoi        [ildi, soup]
2   India    Hilton                  []
3     USA       Taj              [dosa]
4     USA    Hilton     [pizza, burger]
5   Italy    Oberoi                  []
6   Italy  Marriott      [pasta, pizza]
烟燃烟灭 2025-02-17 18:56:30

我正在使用的桌子太大(〜100m行),因此我最初寻求的方法不起作用。因此,我现在已经做出了一种更好的方法。

import pandas as pd

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

c_dict = cm_df.groupby('country')['menu'].apply(set).to_dict()
h_dict = hm_df.groupby('hotel')['menu'].apply(set).to_dict()

# This won't work if there are missing keys in the dictionary column, as it was in my case
# ch_df['new_col'] = ch_df.apply(lambda x: '|'.join(c_dict[x.country].intersection(h_dict[x.hotel])), axis=1)

def menu_list(a,b):
    if a in c_dict and b in h_dict:
        return '|'.join(c_dict[a].intersection(h_dict[b]))
    return ''

ch_df['menu_list'] = ch_df.apply(lambda x: menu_list(x.country,x.hotel), axis=1)

ch_df

The tables I was working with were too large (~100M rows) so the approach I initially sought does not work. So I have now made, hopefully, a better approach.

import pandas as pd

ch = {'country':['India','India','India','USA','USA','Italy','Italy'],'hotel':['Taj','Oberoi','Hilton','Taj','Hilton','Oberoi','Marriott']}
ch_df = pd.DataFrame.from_dict(ch)

hm = {'hotel':['Taj','Taj','Taj','Oberoi','Oberoi','Marriott','Marriott','Marriott','Hilton','Hilton'],'menu':['ildi','dosa','soup','soup','ildi','soup','pasta','pizza','pizza','burger']}
hm_df = pd.DataFrame.from_dict(hm)

cm = {'country':['India','India','India','USA','USA','USA','Italy','Italy'],'menu':['ildi','dosa','soup','dosa','burger','pizza','pizza','pasta']}
cm_df = pd.DataFrame.from_dict(cm)

c_dict = cm_df.groupby('country')['menu'].apply(set).to_dict()
h_dict = hm_df.groupby('hotel')['menu'].apply(set).to_dict()

# This won't work if there are missing keys in the dictionary column, as it was in my case
# ch_df['new_col'] = ch_df.apply(lambda x: '|'.join(c_dict[x.country].intersection(h_dict[x.hotel])), axis=1)

def menu_list(a,b):
    if a in c_dict and b in h_dict:
        return '|'.join(c_dict[a].intersection(h_dict[b]))
    return ''

ch_df['menu_list'] = ch_df.apply(lambda x: menu_list(x.country,x.hotel), axis=1)

ch_df
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文