如何在具有多列的同一数据集中进行模糊匹配

发布于 2025-01-11 21:07:56 字数 1783 浏览 0 评论 0原文

我有一个学生排名数据集,其中缺少一些值,我想对其进行模糊逻辑 同一数据集中的名称和排名列,查找最佳匹配值,更新其余列的空值,并添加匹配的名称列、匹配的排名列和分数。我是一个初学者,如果有人的话那就太好了 帮我。谢谢。

data:

    Name  School Marks Location Rank
0   JACK   TML    90    AU       3
1   JHON   SSP    85    NULL     NULL
2   NULL   TML    NULL  AU       3
3   BECK   NTC    NULL  EU       2
4   JHON   SSP    NULL  JP       1
5   SEON   NTC    80    RS       5



Expected Data Output:
data:

    Name  School Marks Location Rank Matched_Name Matched_Rank Score
0   JACK   TML    90    AU       3     Jack           3         100
1   JHON   SSP    85    JP       1     JHON           1         100
2   BECK   NTC    NULL  EU       2      -             -          -
3   SEON   NTC    80    RS       5      -             -          -

我如何用模糊逻辑来做到这一点?

这是我的代码

ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)


# Columns to match on from df_left
left_on = ["Name", "Rank"]

# Columns to match on from df_right
right_on = ["Name", "Rank"]

# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
                                               ds2,
                                               left_on,
                                               right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)

print(matched_results)
try:
    print(matched_results.columns)
    cols = matched_results.columns
except:
    pass
print(matched_results.to_csv('matched_results.csv',index=False))

# Let's see the best matches
try:
    matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
    pass

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on
names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone
help me. Thank You.

data:

    Name  School Marks Location Rank
0   JACK   TML    90    AU       3
1   JHON   SSP    85    NULL     NULL
2   NULL   TML    NULL  AU       3
3   BECK   NTC    NULL  EU       2
4   JHON   SSP    NULL  JP       1
5   SEON   NTC    80    RS       5



Expected Data Output:
data:

    Name  School Marks Location Rank Matched_Name Matched_Rank Score
0   JACK   TML    90    AU       3     Jack           3         100
1   JHON   SSP    85    JP       1     JHON           1         100
2   BECK   NTC    NULL  EU       2      -             -          -
3   SEON   NTC    80    RS       5      -             -          -

I how to do it with fuzzy logic ?

here is my code

ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)


# Columns to match on from df_left
left_on = ["Name", "Rank"]

# Columns to match on from df_right
right_on = ["Name", "Rank"]

# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
                                               ds2,
                                               left_on,
                                               right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)

print(matched_results)
try:
    print(matched_results.columns)
    cols = matched_results.columns
except:
    pass
print(matched_results.to_csv('matched_results.csv',index=False))

# Let's see the best matches
try:
    matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
    pass

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

还不是爱你 2025-01-18 21:07:56

使用 fuzzywuzzy 通常是在名称不完全匹配的情况下。在你的例子中我看不到这一点。但是,如果您的姓名不完全匹配,您可以执行以下操作:

  1. 创建所有学校名称的列表
df['school_name'].tolist()
  1. 使用“在数据框中查找空值”
  2. 。使用
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)

请记住,如果您有确切的名称,则永远不要使用 Fuzzy。您只需要像这样过滤数据框:

filtered = df[df['school_name'] == x]

并用它来替换原始数据框中的值。

Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:

  1. Create a list of all school names using
df['school_name'].tolist()
  1. Find null values in your data frame.
  2. Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)

Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:

filtered = df[df['school_name'] == x]

and use it to replace values in the original data frame.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文