模糊匹配和分组

发布于 2025-01-13 05:07:58 字数 992 浏览 2 评论 0原文

我正在尝试使用 Python 在多个字段上进行模糊匹配和分组。我想在不同的模糊阈值上对每一列进行比较。我尝试在谷歌上搜索,但找不到任何可以执行重复数据删除然后在不同列上创建组的解决方案。

输入:

姓名地址
Robert9185 Pumpkin Hill St.
Rob9185 Pumpkin Hill Street
Mike1296 Tunnel St.
MikeTunnel Street 1296
John6200 Beechwood Drive

输出:

群组 ID姓名地址
1Robert9185 Pumpkin Hill Street
1Rob9185 Pumpkin Hill Street
2Mike1296 Tunnel St.
2迈克隧道街 1296 号
3约翰6200 Beechwood Drive

I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.

Input:

NameAddress
Robert9185 Pumpkin Hill St.
Rob9185 Pumpkin Hill Street
Mike1296 Tunnel St.
MikeTunnel Street 1296
John6200 Beechwood Drive

Output:

Group IDNameAddress
1Robert9185 Pumpkin Hill St.
1Rob9185 Pumpkin Hill Street
2Mike1296 Tunnel St.
2MikeTunnel Street 1296
3John6200 Beechwood Drive

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

春风十里 2025-01-20 05:07:58

我建议您查看 Levenstein 距离,因为这是识别相似字符串的常用算法。 FuzzWuzzy 库(我知道这个名字很傻)用 3 种不同的方法来实现它。请参阅这篇文章了解更多信息

这是比较每个字符串的起始位置与其他所有字符串相对应。您提到有不同的阈值,因此所需要做的就是循环 l_match 并根据您所需的阈值对它们进行分组


#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz

l_data =[
     ['Robert','9185 Pumpkin Hill St.']
    ,['Rob','9185 Pumpkin Hill Street']
    ,['Mike','1296 Tunnel St.']
    ,['Mike','Tunnel Street 1296']
    ,['John','6200 Beechwood Drive']
]
l_match = []

#loop through data
for idx1,row1 in enumerate(l_data):
    #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
    for idx2,row2 in enumerate(l_data[idx1+1:]):
        #Calculates index in original array for row2
        origIdx=idx1+idx2+1
        l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])

#Print raw data with index
for idx,val in enumerate(l_data):
    print(f'{idx}-{val}')
print ("*" * 100)

#Print results of comparison
for row in l_match:
    id1 = row[0]
    id2 = row[1]
    formattedName1 = f'{id1}-{l_data[id1][0]}'
    formattedName2 = f'{id2}-{l_data[id2][0]}'
    print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')

结果:

0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio

I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info

Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through l_match and group them depending on your desired thresholds


#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz

l_data =[
     ['Robert','9185 Pumpkin Hill St.']
    ,['Rob','9185 Pumpkin Hill Street']
    ,['Mike','1296 Tunnel St.']
    ,['Mike','Tunnel Street 1296']
    ,['John','6200 Beechwood Drive']
]
l_match = []

#loop through data
for idx1,row1 in enumerate(l_data):
    #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
    for idx2,row2 in enumerate(l_data[idx1+1:]):
        #Calculates index in original array for row2
        origIdx=idx1+idx2+1
        l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])

#Print raw data with index
for idx,val in enumerate(l_data):
    print(f'{idx}-{val}')
print ("*" * 100)

#Print results of comparison
for row in l_match:
    id1 = row[0]
    id2 = row[1]
    formattedName1 = f'{id1}-{l_data[id1][0]}'
    formattedName2 = f'{id2}-{l_data[id2][0]}'
    print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')

Results:

0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio
一片旧的回忆 2025-01-20 05:07:58

Stephan 很好地解释了代码。我不需要再解释了。您也可以尝试使用 fuzz.partial_ratio 。它可能会提供一些有趣的结果。

from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100

Stephan explained the code pretty well. I don't need to explain again. You can try using the fuzz.partial_ratio as well. It might provide some interesting results.

from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文