大熊猫数据框的最佳条件连接

发布于 2025-01-25 22:31:05 字数 9499 浏览 4 评论 0 原文

我有一种情况,我正在尝试将 df_a 加入 df_b

实际上,这些 dataframes 具有形状:(389944,121)> )

如果以下条件的任何是正确的,我需要有条件地加入这两个数据范围。如果多个,它只需要加入一次:

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

例如...

df_a:
播放器 Michael 网站
Jordan www.michaeljordan.com < /a> y
lebron james www.kingjames.com y
kobe bryant www.mamba.com y
larry bird www.larrybird。 com y
luka doncic www.77.com n
df_b:
平台 url url web_addr web_addr nots twitter the < a
twitter twitter a twitter https://twitter.com/luka7doncic www.77.com luka7doncic 1500000 347
Twitter www.larrybird.com wiki/larry_bird“ rel =“ nofollow noreferrer”> https://en.wikipedia.org/wiki/wiki/larry_bird < A href =“ http://wwww.larrybird.com www.larrybird.com
叽叽喳喳 https://www.michaeljordansworld.com/ com“ rel =“ nofollow noreferrer”> www.michaeljordan.com
Twitter https:// twitter rel =“ nofollow noreferrer”> https://granitystudios.com/ Kobe Bryant 14900000 514
Twitter fooman.com thefoo.com foobar foobarman 1 1
Twitter www.stackoverflow.com

理想情况下, df_a 获取左加入 df_b 将 handle> handle> handle>,关注者 ,和以下内容 fields

播放器 网站 商品 处理 追随者 Michael
Jordan www.michaeljordan.com y NH 0 0
勒布朗詹姆斯 www.kingjames.com y null null null null null null null null null null null
kobe bryant www.mamba.com y Kobe Bryant 14900000 14900000 514
Larry Bird www.larrybird.com y nh 0 0
luka doncic www.77.com N luka7doncic 1500000 347

a 最小,可重复的示例在下面:

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

但是,这会产生带有重复列的错误结果。

如何更有效地做到这一点?

I have a situation where I am trying to join df_a to df_b

In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)

I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

For an example...

df_a:
player website merch
michael jordan www.michaeljordan.com Y
Lebron James www.kingjames.com Y
Kobe Bryant www.mamba.com Y
Larry Bird www.larrybird.com Y
luka Doncic www.77.com N
df_b:
platform url web_addr notes handle followers following
Twitter https://twitter.com/luka7doncic www.77.com luka7doncic 1500000 347
Twitter www.larrybird.com https://en.wikipedia.org/wiki/Larry_Bird www.larrybird.com
Twitter https://www.michaeljordansworld.com/ www.michaeljordan.com
Twitter https://twitter.com/kobebryant https://granitystudios.com/ https://granitystudios.com/ Kobe Bryant 14900000 514
Twitter fooman.com thefoo.com foobar foobarman 1 1
Twitter www.stackoverflow.com

Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields

player website merch handle followers following
michael jordan www.michaeljordan.com Y nh 0 0
Lebron James www.kingjames.com Y null null null
Kobe Bryant www.mamba.com Y Kobe Bryant 14900000 514
Larry Bird www.larrybird.com Y nh 0 0
luka Doncic www.77.com N luka7doncic 1500000 347

A minimal, reproducible example is below:

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

However, this produces erroneous results with duplicate columns.

How can I do this more efficiently and with correct results?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

几味少女 2025-02-01 22:31:05

使用:

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com 

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic  

Use:

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com 

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic  
葬シ愛 2025-02-01 22:31:05

您可以通过列表传递 left_on right_on -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)

You can pass left_on and right_on with lists -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文