在2个PANDAS数据框架中执行模糊匹配

发布于 2025-01-30 16:23:02 字数 1100 浏览 1 评论 0原文

我有两个数据范围,不同的行编号包含有关玩家的信息。第一个具有我需要的所有名称。

df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})

另一个数据框架缺少某些玩家,但具有年龄的专栏。在某些情况下,玩家的名字差异较小。

df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
                   'Age': [22, 21, 26, 18, 19, 25]})

等于名字是不同的人,因此我需要在同一时间和团队中匹配。我想创建一个新的DataFrame,其中包括第一个数据帧的所有名称,并从第二个数据框架上各自的年龄。如果第二次缺少播放器,则具有恒定值的完整新数据帧(例如XX年,可以是任何年龄。最终数据帧:

print(final_df)
           Player Team  Age
0       John Sepi    A   22
1        Zan Fred    C   XX
2     Mark Daniel    E   21
3        Adam Pop    C   XX
4       Paul Sepi    B   XX
5  John Hernandez    D   26
6    Price Josiah    B   18
7  John Hernandez    A   19
8        Adam Pop    D   25

I have two dataframes with different rows numbers contain information about players. The first has all names that I need.

df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})

The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.

df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
                   'Age': [22, 21, 26, 18, 19, 25]})

The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:

print(final_df)
           Player Team  Age
0       John Sepi    A   22
1        Zan Fred    C   XX
2     Mark Daniel    E   21
3        Adam Pop    C   XX
4       Paul Sepi    B   XX
5  John Hernandez    D   26
6    Price Josiah    B   18
7  John Hernandez    A   19
8        Adam Pop    D   25

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

薄情伤 2025-02-06 16:23:02

您可以使用fuzzywuzzy库的文本匹配功能与Python中的Pandas功能混合。

首先,导入以下库:

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

您可以使用fuzzywuzzy python库的文本匹配功能:

#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
    #iterration on dataframe filtered by team
    for index, row in df1.loc[df1['Team']==team].iterrows():
        #get list of players in this team
        lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
        #use of fuzzywuzzy to make text matching
        output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
        #check if there is players from df2 in this team
        if output_ratio !=[]:
            #put arbitrary threshold to get most similar text
            if output_ratio[0][1]>thres:
                df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')

使用此代码和定义为70的阈值,您将获得以下结果:

print(df1)
           Player Team Age
0       John Sepi    A  22
1        Zan Fred    C  XX
2     Mark Daniel    E  21
3        Adam Pop    C  XX
4       Paul Sepi    B  XX
5  John Hernandez    D  26
6    Price Josiah    B  18
7  John Hernandez    A  19
8        Adam Pop    D  25

可以移动阈值为了提高两个数据范围之间的文本匹配功能的准确性。

请注意,不建议使用.iterrows()作为数据框架上的迭代时要小心。

您可以检查fuzzywuzzy doc: /

You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.

First, import the following libraries :

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

You can use the text matching capabilities of the fuzzywuzzy python library :

#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
    #iterration on dataframe filtered by team
    for index, row in df1.loc[df1['Team']==team].iterrows():
        #get list of players in this team
        lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
        #use of fuzzywuzzy to make text matching
        output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
        #check if there is players from df2 in this team
        if output_ratio !=[]:
            #put arbitrary threshold to get most similar text
            if output_ratio[0][1]>thres:
                df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')

with this code and a threshold defined as 70, you get the following result:

print(df1)
           Player Team Age
0       John Sepi    A  22
1        Zan Fred    C  XX
2     Mark Daniel    E  21
3        Adam Pop    C  XX
4       Paul Sepi    B  XX
5  John Hernandez    D  26
6    Price Josiah    B  18
7  John Hernandez    A  19
8        Adam Pop    D  25

It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.

Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.

You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/

看春风乍起 2025-02-06 16:23:02

这是一种方法:

df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)

here is one way:

df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文