模糊匹配和分组
我正在尝试使用 Python 在多个字段上进行模糊匹配和分组。我想在不同的模糊阈值上对每一列进行比较。我尝试在谷歌上搜索,但找不到任何可以执行重复数据删除然后在不同列上创建组的解决方案。
输入:
姓名 | 地址 |
---|---|
Robert | 9185 Pumpkin Hill St. |
Rob | 9185 Pumpkin Hill Street |
Mike | 1296 Tunnel St. |
Mike | Tunnel Street 1296 |
John | 6200 Beechwood Drive |
输出:
群组 ID | 姓名 | 地址 |
---|---|---|
1 | Robert | 9185 Pumpkin Hill Street |
1 | Rob | 9185 Pumpkin Hill Street |
2 | Mike | 1296 Tunnel St. |
2 | 迈克 | 隧道街 1296 号 |
3 | 约翰 | 6200 Beechwood Drive |
I am trying to do fuzzy match and grouping using Python on multiple fields. I want to do the comparison on each column on a different fuzzy threshold. I tried to search on google but could not find any solution which can do deduplication and then create groups on different columns.
Input:
Name | Address |
---|---|
Robert | 9185 Pumpkin Hill St. |
Rob | 9185 Pumpkin Hill Street |
Mike | 1296 Tunnel St. |
Mike | Tunnel Street 1296 |
John | 6200 Beechwood Drive |
Output:
Group ID | Name | Address |
---|---|---|
1 | Robert | 9185 Pumpkin Hill St. |
1 | Rob | 9185 Pumpkin Hill Street |
2 | Mike | 1296 Tunnel St. |
2 | Mike | Tunnel Street 1296 |
3 | John | 6200 Beechwood Drive |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议您查看 Levenstein 距离,因为这是识别相似字符串的常用算法。 FuzzWuzzy 库(我知道这个名字很傻)用 3 种不同的方法来实现它。请参阅这篇文章了解更多信息
这是比较每个字符串的起始位置与其他所有字符串相对应。您提到有不同的阈值,因此所需要做的就是循环
l_match
并根据您所需的阈值对它们进行分组结果:
I'd recommend reviewing Levenstein distance as this is a common algorithm to identify similar strings. Library FuzzWuzzy(goofy name I know) implements it with 3 different approaches. See this article for more info
Here's a starting place that compares each string against every other string. You mention having different thresholds, so all would need to do is loop through
l_match
and group them depending on your desired thresholdsResults:
Stephan 很好地解释了代码。我不需要再解释了。您也可以尝试使用 fuzz.partial_ratio 。它可能会提供一些有趣的结果。
Stephan explained the code pretty well. I don't need to explain again. You can try using the fuzz.partial_ratio as well. It might provide some interesting results.