模糊匹配列，列表的正确名称

发布于 2025-02-09 23:00:58 字数 428 浏览 1 评论 0原文

我有带有错别字的DataFrame列。

ID	Banknane
1	美国银行
2	美国
3	JP MORG
4	JP MORGAN

和我有一个带有银行名称的清单。

["Bank of America", "JPMorgan Chase]

我想在Levenshtein距离的帮助下，用列表的正确名称来检查并替换错误的钞票名称。

原文

I have dataframe column with typos.

ID	Banknane
1	Bank of America
2	bnk of America
3	Jp Morg
4	Jp Morgan

And I have a list with the right names of the banks.

["Bank of America", "JPMorgan Chase]

I want to check and replace wrong banknames with the right names of the list with the help of levenshtein distance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

榕城若虚 2025-02-16 23:00:58

这是一种使用Python标准库 fivflib 模块，这是一种简单的方法。提供计算三角洲的助手。

from difflib import SequenceMatcher

# Define a helper function
def match(x, values, threshold):
    def ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()

    results = {
        value: ratio(value, x) for value in values if ratio(value, x) > threshold
    }
    return max(results, key=results.get) if results else x

然后：

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4],
        "Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
    }
)

names = ["Bank of America", "JPMorgan Chase"]

df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))

这样：

print(df)
# Output
   ID         Bankname
0   1  Bank of America
1   2  Bank of America
2   3   JPMorgan Chase
3   4   JPMorgan Chase

当然，您可以将Inner 比率函数替换为任何其他更适合的序列匹配器。

Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.

from difflib import SequenceMatcher

# Define a helper function
def match(x, values, threshold):
    def ratio(a, b):
        return SequenceMatcher(None, a, b).ratio()

    results = {
        value: ratio(value, x) for value in values if ratio(value, x) > threshold
    }
    return max(results, key=results.get) if results else x

And then:

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4],
        "Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
    }
)

names = ["Bank of America", "JPMorgan Chase"]

df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))

So that:

print(df)
# Output
   ID         Bankname
0   1  Bank of America
1   2  Bank of America
2   3   JPMorgan Chase
3   4   JPMorgan Chase

Of course, you can replace the inner ratio function with any other more appropriated sequence matcher.

回复收藏 0 原文

~没有更多了~

关于作者

走过海棠暮

暂无简介

文章

27 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

模糊匹配列，列表的正确名称

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

模糊匹配列，列表的正确名称

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。