如何在具有多列的同一数据集中进行模糊匹配

发布于 2025-01-11 21:07:56 字数 1783 浏览 3 评论 0原文

我有一个学生排名数据集，其中缺少一些值，我想对其进行模糊逻辑同一数据集中的名称和排名列，查找最佳匹配值，更新其余列的空值，并添加匹配的名称列、匹配的排名列和分数。我是一个初学者，如果有人的话那就太好了帮我。谢谢。

data:

    Name  School Marks Location Rank
0   JACK   TML    90    AU       3
1   JHON   SSP    85    NULL     NULL
2   NULL   TML    NULL  AU       3
3   BECK   NTC    NULL  EU       2
4   JHON   SSP    NULL  JP       1
5   SEON   NTC    80    RS       5



Expected Data Output:
data:

    Name  School Marks Location Rank Matched_Name Matched_Rank Score
0   JACK   TML    90    AU       3     Jack           3         100
1   JHON   SSP    85    JP       1     JHON           1         100
2   BECK   NTC    NULL  EU       2      -             -          -
3   SEON   NTC    80    RS       5      -             -          -

我如何用模糊逻辑来做到这一点？

这是我的代码

ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)


# Columns to match on from df_left
left_on = ["Name", "Rank"]

# Columns to match on from df_right
right_on = ["Name", "Rank"]

# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
                                               ds2,
                                               left_on,
                                               right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)

print(matched_results)
try:
    print(matched_results.columns)
    cols = matched_results.columns
except:
    pass
print(matched_results.to_csv('matched_results.csv',index=False))

# Let's see the best matches
try:
    matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
    pass

原文

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.

data:

    Name  School Marks Location Rank
0   JACK   TML    90    AU       3
1   JHON   SSP    85    NULL     NULL
2   NULL   TML    NULL  AU       3
3   BECK   NTC    NULL  EU       2
4   JHON   SSP    NULL  JP       1
5   SEON   NTC    80    RS       5



Expected Data Output:
data:

    Name  School Marks Location Rank Matched_Name Matched_Rank Score
0   JACK   TML    90    AU       3     Jack           3         100
1   JHON   SSP    85    JP       1     JHON           1         100
2   BECK   NTC    NULL  EU       2      -             -          -
3   SEON   NTC    80    RS       5      -             -          -

I how to do it with fuzzy logic ?

here is my code

ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)


# Columns to match on from df_left
left_on = ["Name", "Rank"]

# Columns to match on from df_right
right_on = ["Name", "Rank"]

# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
                                               ds2,
                                               left_on,
                                               right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)

print(matched_results)
try:
    print(matched_results.columns)
    cols = matched_results.columns
except:
    pass
print(matched_results.to_csv('matched_results.csv',index=False))

# Let's see the best matches
try:
    matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
    pass

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

还不是爱你 2025-01-18 21:07:56

使用 fuzzywuzzy 通常是在名称不完全匹配的情况下。在你的例子中我看不到这一点。但是，如果您的姓名不完全匹配，您可以执行以下操作：

创建所有学校名称的列表

df['school_name'].tolist()

使用“在数据框中查找空值”
。使用

process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)

请记住，如果您有确切的名称，则永远不要使用 Fuzzy。您只需要像这样过滤数据框：

filtered = df[df['school_name'] == x]

并用它来替换原始数据框中的值。

Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:

Create a list of all school names using

df['school_name'].tolist()

Find null values in your data frame.
Use

process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)

Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:

filtered = df[df['school_name'] == x]

and use it to replace values in the original data frame.

回复收藏 0 原文

~没有更多了~

关于作者

七秒鱼°

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何在具有多列的同一数据集中进行模糊匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如何在具有多列的同一数据集中进行模糊匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。