不同模糊比率的模糊匹配

发布于 2025-01-11 07:54:14 字数 2009 浏览 0 评论 0原文

我有两个大数据集。 df1约为1m行,df2约为10m行。我需要从 df2 中查找 df1 中的行的匹配项。
我已经单独发布了这个问题的原始版本。请参阅此处 。 @laurent 回答得很好,但我现在有一些额外的细节。我现在想要:

  1. 获取最终匹配数据帧的列中每个 fname 和 lname 的模糊比率

  2. 编写代码,将 fname 的模糊比率设置为 > 60,而 lname 的模糊比率设置为 > 75。换句话说,如果fname的fuzz_ratio>60并且lname的fuzz_ratio>75,则发生真正的匹配;否则不是真正的匹配。如果 fname 的模糊比率 == 80 而 lname 的模糊比率 == 60,则匹配不正确。虽然我知道这可以从 (1) 中作为事后过滤来完成,但在不同匹配的编码阶段执行此操作是有意义的。

我在这里发布了我的数据示例。 @laurent 针对原始问题的解决方案可以在上面的链接中找到。

import pandas as pd

df1 = pd.DataFrame(
    {
        "ein": {0: 1001, 1: 1500, 2: 3000},
        "ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
    }
)

df2 = pd.DataFrame(
    {
        "lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
        "fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
        "score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
    }
)

预期输出为:

df3 = pd.DataFrame(
    {
        "df1_ein": {0: 1001, 1: 1500, 2: 3000},
        "df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
        "fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
        "fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
        "df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
        "df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
        "df2_score": {0: 2.5, 1: 3.5, 2: NA},

    }
)

上述预期输出的注释:根据我指定的模糊比率,Bradley Cupper 与 Bradley Cooper 不匹配。与布拉德利·库珀 (Bradley Cooper) 更匹配的是 M·布拉德·库珀 (M Brad Couper)。同样,托马斯·克鲁斯与汤姆·克鲁斯匹配,而不是与托马斯·克鲁兹匹配。

我主要是Stata的用户(哈哈),reclink2 ado文件理论上可以做到上述,即如果Stata可以处理数据的大小。然而,考虑到我拥有的数据量,几小时后甚至没有任何启动。

I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by @laurent but I have some added specificities now. I would now like to:

  1. Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe

  2. Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.

I post here an example of my data. The solution by @laurent for the original problem can be found in the above link.

import pandas as pd

df1 = pd.DataFrame(
    {
        "ein": {0: 1001, 1: 1500, 2: 3000},
        "ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
    }
)

df2 = pd.DataFrame(
    {
        "lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
        "fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
        "score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
    }
)

Expected output is:

df3 = pd.DataFrame(
    {
        "df1_ein": {0: 1001, 1: 1500, 2: 3000},
        "df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
        "df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
        "df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
        "fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
        "fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
        "df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
        "df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
        "df2_score": {0: 2.5, 1: 3.5, 2: NA},

    }
)

Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.

I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

贩梦商人 2025-01-18 07:54:14

这是一种方法:

import pandas as pd
from fuzzywuzzy import fuzz

# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]

# Add new columns
df1["fuzz_ratio_lname"] = (
    df1["df1_lname"]
    .apply(
        lambda x: max(
            [(value, fuzz.ratio(x, value)) for value in df2["lname"]],
            key=lambda x: x[1],
        )
    )
    .apply(lambda x: x if x[1] > 75 else pd.NA)
)

df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
    df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
    pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
    .drop(columns="lname")
    .rename(columns={"fname": "df2_fname"})
)

df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
    ratio = fuzz.ratio(x, value)
    df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA

# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
    [
        "df1_ein",
        "df1_ein_name",
        "df1_lname",
        "df1_fname",
        "fuzz_ratio_lname",
        "fuzz_ratio_fname",
        "df2_lname",
        "df2_fname",
        "score",
    ]
]

print(df1)
# Output
   df1_ein         df1_ein_name df1_lname df1_fname  fuzz_ratio_lname  \
0     1001       H for Humanity    Cooper   Bradley              83.0   
1     1500          Labor Union    Cruise    Thomas             100.0   
2     3000  Something something      Pitt      Brad               NaN   

  fuzz_ratio_fname df2_lname df2_fname  score  
0             62.0    Couper    M Brad    2.5  
1             67.0    Cruise       Tom    3.5  
2             <NA>      <NA>      <NA>    NaN  

Here is one way to do it:

import pandas as pd
from fuzzywuzzy import fuzz

# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]

# Add new columns
df1["fuzz_ratio_lname"] = (
    df1["df1_lname"]
    .apply(
        lambda x: max(
            [(value, fuzz.ratio(x, value)) for value in df2["lname"]],
            key=lambda x: x[1],
        )
    )
    .apply(lambda x: x if x[1] > 75 else pd.NA)
)

df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
    df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
    pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
    .drop(columns="lname")
    .rename(columns={"fname": "df2_fname"})
)

df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
    ratio = fuzz.ratio(x, value)
    df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA

# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
    [
        "df1_ein",
        "df1_ein_name",
        "df1_lname",
        "df1_fname",
        "fuzz_ratio_lname",
        "fuzz_ratio_fname",
        "df2_lname",
        "df2_fname",
        "score",
    ]
]

print(df1)
# Output
   df1_ein         df1_ein_name df1_lname df1_fname  fuzz_ratio_lname  \
0     1001       H for Humanity    Cooper   Bradley              83.0   
1     1500          Labor Union    Cruise    Thomas             100.0   
2     3000  Something something      Pitt      Brad               NaN   

  fuzz_ratio_fname df2_lname df2_fname  score  
0             62.0    Couper    M Brad    2.5  
1             67.0    Cruise       Tom    3.5  
2             <NA>      <NA>      <NA>    NaN  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文