模糊匹配和分组

发布于 2025-01-13 05:07:58 字数 992 浏览 2 评论 0原文

我正在尝试使用 Python 在多个字段上进行模糊匹配和分组。我想在不同的模糊阈值上对每一列进行比较。我尝试在谷歌上搜索，但找不到任何可以执行重复数据删除然后在不同列上创建组的解决方案。

输入：

姓名	地址
Robert	9185 Pumpkin Hill St.
Rob	9185 Pumpkin Hill Street
Mike	1296 Tunnel St.
Mike	Tunnel Street 1296
John	6200 Beechwood Drive

输出：

群组 ID	姓名	地址
1	Robert	9185 Pumpkin Hill Street
1	Rob	9185 Pumpkin Hill Street
2	Mike	1296 Tunnel St.
2	迈克	隧道街 1296 号
3	约翰	6200 Beechwood Drive

原文

I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.

Input:

Name	Address
Robert	9185 Pumpkin Hill St.
Rob	9185 Pumpkin Hill Street
Mike	1296 Tunnel St.
Mike	Tunnel Street 1296
John	6200 Beechwood Drive

Output:

Group ID	Name	Address
1	Robert	9185 Pumpkin Hill St.
1	Rob	9185 Pumpkin Hill Street
2	Mike	1296 Tunnel St.
2	Mike	Tunnel Street 1296
3	John	6200 Beechwood Drive

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春风十里 2025-01-20 05:07:58

我建议您查看 Levenstein 距离，因为这是识别相似字符串的常用算法。 FuzzWuzzy 库（我知道这个名字很傻）用 3 种不同的方法来实现它。请参阅这篇文章了解更多信息

这是比较每个字符串的起始位置与其他所有字符串相对应。您提到有不同的阈值，因此所需要做的就是循环 l_match 并根据您所需的阈值对它们进行分组


#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz

l_data =[
     ['Robert','9185 Pumpkin Hill St.']
    ,['Rob','9185 Pumpkin Hill Street']
    ,['Mike','1296 Tunnel St.']
    ,['Mike','Tunnel Street 1296']
    ,['John','6200 Beechwood Drive']
]
l_match = []

#loop through data
for idx1,row1 in enumerate(l_data):
    #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
    for idx2,row2 in enumerate(l_data[idx1+1:]):
        #Calculates index in original array for row2
        origIdx=idx1+idx2+1
        l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])

#Print raw data with index
for idx,val in enumerate(l_data):
    print(f'{idx}-{val}')
print ("*" * 100)

#Print results of comparison
for row in l_match:
    id1 = row[0]
    id2 = row[1]
    formattedName1 = f'{id1}-{l_data[id1][0]}'
    formattedName2 = f'{id2}-{l_data[id2][0]}'
    print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')

结果：

0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio

I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info

Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through l_match and group them depending on your desired thresholds


#Run this to install the required libraries
#pip install python-levenshtein fuzzywuzzy
from fuzzywuzzy import fuzz

l_data =[
     ['Robert','9185 Pumpkin Hill St.']
    ,['Rob','9185 Pumpkin Hill Street']
    ,['Mike','1296 Tunnel St.']
    ,['Mike','Tunnel Street 1296']
    ,['John','6200 Beechwood Drive']
]
l_match = []

#loop through data
for idx1,row1 in enumerate(l_data):
    #compare each person with every person that comes after later in the list (so compares only 1 way instead of comparing A vs B and B vs A)
    for idx2,row2 in enumerate(l_data[idx1+1:]):
        #Calculates index in original array for row2
        origIdx=idx1+idx2+1
        l_match.append([idx1,origIdx,fuzz.ratio(row1[0],row2[0]),fuzz.ratio(row1[1],row2[1])])

#Print raw data with index
for idx,val in enumerate(l_data):
    print(f'{idx}-{val}')
print ("*" * 100)

#Print results of comparison
for row in l_match:
    id1 = row[0]
    id2 = row[1]
    formattedName1 = f'{id1}-{l_data[id1][0]}'
    formattedName2 = f'{id2}-{l_data[id2][0]}'
    print (f'{formattedName1} and {formattedName2} have {row[2]}% name similarity ratio and {row[3]}% address similarity ratio')

Results:

0-['Robert', '9185 Pumpkin Hill St.']
1-['Rob', '9185 Pumpkin Hill Street']
2-['Mike', '1296 Tunnel St.']
3-['Mike', 'Tunnel Street 1296']
4-['John', '6200 Beechwood Drive']
****************************************************************************************************
0-Robert and 1-Rob have 67% name similarity ratio and 89% address similarity ratio
0-Robert and 2-Mike have 20% name similarity ratio and 50% address similarity ratio
0-Robert and 3-Mike have 20% name similarity ratio and 31% address similarity ratio
0-Robert and 4-John have 20% name similarity ratio and 15% address similarity ratio
1-Rob and 2-Mike have 0% name similarity ratio and 41% address similarity ratio
1-Rob and 3-Mike have 0% name similarity ratio and 48% address similarity ratio
1-Rob and 4-John have 29% name similarity ratio and 18% address similarity ratio
2-Mike and 3-Mike have 100% name similarity ratio and 55% address similarity ratio
2-Mike and 4-John have 0% name similarity ratio and 23% address similarity ratio
3-Mike and 4-John have 0% name similarity ratio and 21% address similarity ratio

回复收藏 0 原文

一片旧的回忆 2025-01-20 05:07:58

Stephan 很好地解释了代码。我不需要再解释了。您也可以尝试使用 fuzz.partial_ratio 。它可能会提供一些有趣的结果。

from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100

Stephan explained the code pretty well. I don't need to explain again. You can try using the fuzz.partial_ratio as well. It might provide some interesting results.

from thefuzz import fuzz
print(fuzz.ratio("Turkey is the best country", "Turkey is the best country!"))
#98
print(fuzz.partial_ratio("Turkey is the best country", "Turkey is the best country!"))
#100

回复收藏 0 原文

~没有更多了~