当前位置：文江博客话题详情

大熊猫数据框的最佳条件连接

发布于 2025-01-25 22:31:05 字数 9499 浏览 4 评论 0 原文

我有一种情况，我正在尝试将 df_a 加入 df_b

实际上，这些 dataframes 具有形状：（389944，121）> ）

如果以下条件的任何是正确的，我需要有条件地加入这两个数据范围。如果多个，它只需要加入一次：

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

例如...

df_a:

播放器	Michael	网站
Jordan	www.michaeljordan.com < /a>	y
lebron james	www.kingjames.com	y
kobe bryant	www.mamba.com	y
larry bird	www.larrybird。 com	y
luka doncic	www.77.com	n

df_b:

平台	url	url web_addr	web_addr nots	twitter	the	< a
twitter twitter	a twitter https://twitter.com/luka7doncic	www.77.com		luka7doncic	1500000	347
Twitter	www.larrybird.com	wiki/larry_bird“ rel =“ nofollow noreferrer”> https://en.wikipedia.org/wiki/wiki/larry_bird	< A href =“ http://wwww.larrybird.com www.larrybird.com
叽叽喳喳		https://www.michaeljordansworld.com/	com“ rel =“ nofollow noreferrer”> www.michaeljordan.com
Twitter	https://	twitter rel =“ nofollow noreferrer”> https://granitystudios.com/		Kobe Bryant	14900000	514
Twitter	fooman.com	thefoo.com	foobar	foobarman	1	1
Twitter	www.stackoverflow.com

理想情况下， df_a 获取左加入 df_b 将 handle> handle> handle>，，关注者 ，和以下内容 fields

播放器	网站	商品	处理	追随者	Michael
Jordan	www.michaeljordan.com	y	NH	0	0
勒布朗詹姆斯	www.kingjames.com	y	null	null null	null null null null null null null null
kobe bryant	www.mamba.com	y	Kobe Bryant	14900000 14900000	514
Larry Bird	www.larrybird.com	y	nh	0	0
luka doncic	www.77.com	N	luka7doncic	1500000	347

a 最小，可重复的示例在下面：

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

但是，这会产生带有重复列的错误结果。

如何更有效地做到这一点？

原文

I have a situation where I am trying to join df_a to df_b

In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)

I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

For an example...

df_a:

player	website	merch
michael jordan	www.michaeljordan.com	Y
Lebron James	www.kingjames.com	Y
Kobe Bryant	www.mamba.com	Y
Larry Bird	www.larrybird.com	Y
luka Doncic	www.77.com	N

df_b:

platform	url	web_addr	notes	handle	followers	following
Twitter	https://twitter.com/luka7doncic	www.77.com		luka7doncic	1500000	347
Twitter	www.larrybird.com	https://en.wikipedia.org/wiki/Larry_Bird	www.larrybird.com
Twitter		https://www.michaeljordansworld.com/	www.michaeljordan.com
Twitter	https://twitter.com/kobebryant	https://granitystudios.com/	https://granitystudios.com/	Kobe Bryant	14900000	514
Twitter	fooman.com	thefoo.com	foobar	foobarman	1	1
Twitter	www.stackoverflow.com

Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields

player	website	merch	handle	followers	following
michael jordan	www.michaeljordan.com	Y	nh	0	0
Lebron James	www.kingjames.com	Y	null	null	null
Kobe Bryant	www.mamba.com	Y	Kobe Bryant	14900000	514
Larry Bird	www.larrybird.com	Y	nh	0	0
luka Doncic	www.77.com	N	luka7doncic	1500000	347

A minimal, reproducible example is below:

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

However, this produces erroneous results with duplicate columns.

How can I do this more efficiently and with correct results?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

几味少女 2025-02-01 22:31:05

使用：

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic

Use:

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic

回复收藏 0 原文

葬シ愛 2025-02-01 22:31:05

您可以通过列表传递 left_on 和 right_on -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)

You can pass left_on and right_on with lists -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)

回复收藏 0 原文

~没有更多了~