用于模糊字符串比较的好 Python 模块?

发布于 2024-07-15 23:11:07 字数 173 浏览 9 评论 0 原文

我正在寻找一个可以进行简单模糊字符串比较的Python 模块。 具体来说,我想要了解字符串相似程度的百分比。 我知道这可能是主观的,所以我希望找到一个可以进行位置比较以及最长相似字符串匹配等的库。

基本上,我希望找到一些足够简单的东西来产生单个百分比,同时仍然足够可配置,以便我可以指定要执行的比较类型。

I'm looking for a Python module that can do simple fuzzy string comparisons. Specifically, I'd like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things.

Basically, I'm hoping to find something that is simple enough to yield a single percentage while still configurable enough that I can specify what type of comparison(s) to do.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

不乱于心 2024-07-22 23:11:07

difflib 可以做到。

文档中的示例:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

查看一下。 它还有其他功能可以帮助您构建自定义的东西。

difflib can do it.

Example from the docs:

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

Check it out. It has other functions that can help you build something custom.

So要识趣 2024-07-22 23:11:07

Levenshtein Python 扩展和 C 库。

https://github.com/ztane/python-Levenshtein/

Levenshtein Python C 扩展模块包含快速功能
的计算
- Levenshtein(编辑)距离和编辑操作
- 字符串相似度
- 近似中值字符串,通常是字符串平均
- 字符串序列和集合相似度
它支持普通字符串和 Unicode 字符串。

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

Levenshtein Python extension and C library.

https://github.com/ztane/python-Levenshtein/

The Levenshtein Python C extension module contains functions for fast
computation of
- Levenshtein (edit) distance, and edit operations
- string similarity
- approximate median strings, and generally string averaging
- string sequence and set similarity
It supports both normal and Unicode strings.

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0
稀香 2024-07-22 23:11:07

正如 nosklo 所说,使用 difflib 模块。

difflib 模块可以使用 ratio() 方法返回序列相似性的度量对象” rel="nofollow noreferrer">SequenceMatcher() 对象。 相似度以 0.0 到 1.0 范围内的浮点数形式返回。

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0

As nosklo said, use the difflib module from the Python standard library.

The difflib module can return a measure of the sequences' similarity using the ratio() method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.

>>> import difflib

>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0

>>> difflib.SequenceMatcher(None, 'abcde', 'zbcde').ratio()
0.80000000000000004

>>> difflib.SequenceMatcher(None, 'abcde', 'zyzzy').ratio()
0.0
烈酒灼喉 2024-07-22 23:11:07

Jellyfish 是一个 Python 模块,支持许多字符串比较指标,包括语音匹配。 与 Jellyfish 的实现相比,Levenstein 编辑距离的纯 Python 实现相当慢。

用法示例:

import jellyfish

>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2 
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`

Jellyfish is a Python module which supports many string comparison metrics including phonetic matching. Pure Python implementations of Levenstein edit distance are quite slow compared to Jellyfish's implementation.

Example Usage:

import jellyfish

>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2 
>>> jellyfish.jaro_distance('jellyfish', 'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance('jellyfish', 'jellyfihs')
1
>>> jellyfish.metaphone('Jellyfish')
'JLFX'
>>> jellyfish.soundex('Jellyfish')
'J412'
>>> jellyfish.nysiis('Jellyfish')
'JALYF'
>>> jellyfish.match_rating_codex('Jellyfish')
'JLLFSH'`
眼角的笑意。 2024-07-22 23:11:07

我喜欢nosklo的回答; 另一种方法是 Damerau-Levenshtein 距离

“在信息论和计算机科学中,Damerau-Levenshtein 距离是两个字符串之间的“距离”(字符串度量),即有限的符号序列,通过计算将一个字符串转换为另一个字符串所需的最小操作数来给出,其中操作定义为单个字符的插入、删除或替换,或两个字符的转置。”

来自 Wikibooks 的 Python 实现:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), \
    lev(a[1:], b)+1, lev(a, b[1:])+1)

来自 Wikibooks 的更多内容,
这将为您提供最长公共子字符串(LCS)的长度:

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                L[i+1][j+1] = L[i][j] + 1
                lcs = max(lcs, L[i+1][j+1])
    return lcs

I like nosklo's answer; another method is the Damerau-Levenshtein distance:

"In information theory and computer science, Damerau–Levenshtein distance is a 'distance' (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters."

An implementation in Python from Wikibooks:

def lev(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(lev(a[1:], b[1:])+(a[0] != b[0]), \
    lev(a[1:], b)+1, lev(a, b[1:])+1)

More from Wikibooks,
this gives you the length of the longest common substring (LCS):

def LCSubstr_len(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    lcs = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                L[i+1][j+1] = L[i][j] + 1
                lcs = max(lcs, L[i+1][j+1])
    return lcs
土豪 2024-07-22 23:11:07

还有 Google 自己的 google-diff-match-patch ("目前可用于 Java、JavaScript、C++ 和 Python”)。

(无法评论,因为我自己只使用过python的difflib)

There is also Google's own google-diff-match-patch ("Currently available in Java, JavaScript, C++ and Python").

(Can't comment on it, since I have only used python's difflib myself)

不必在意 2024-07-22 23:11:07

另一种选择是使用最近发布的包 FuzzyWuzzy这篇博文

Another alternative would be to use the recently released package FuzzyWuzzy. The various functions supported by the package are also described in this blogpost.

吻泪 2024-07-22 23:11:07

我正在使用 double-metaphone ,它的作用就像一个魅力。

一个例子:

>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True

更新:
海蜇也有它。 属于语音编码。

I am using double-metaphone which works like a charm.

An example:

>>> dm(u'aubrey')
('APR', '')
>>> dm(u'richard')
('RXRT', 'RKRT')
>>> dm(u'katherine') == dm(u'catherine')
True

Update:
Jellyfish also has it. Comes under Phonetic encoding.

余生再见 2024-07-22 23:11:07

我一直在使用 Seat Geek 的 Fuzzy Wuzzy,并取得了巨大成功。

https://github.com/seatgeek/fuzzywuzzy

特别是令牌集比率函数......

他们也做了关于模糊字符串匹配过程的精彩文章:

http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python" rel="nofollow">http:// /seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

I've been using Fuzzy Wuzzy from Seat Geek with great success.

https://github.com/seatgeek/fuzzywuzzy

Specifically the token set ratio function...

They also did a great write up on the process of fuzzy string matching:

http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

如梦初醒的夏天 2024-07-22 23:11:07

下面是使用 Charicar 的 simhash 完成此操作的方法,这也适用于长文档,当您也更改文档中的单词顺序时,它也会检测到 100% 相似性

http://blog.simpliplant.eu/calculate-similarity- Between-text-strings-in-python/

Heres the way how it can be done using Charicar's simhash, this is also suitable for long documents, it will detect 100% similarity also when you change order of words in documents too

http://blog.simpliplant.eu/calculating-similarity-between-text-strings-in-python/

独享拥抱 2024-07-22 23:11:07

这是一个用于计算两个单词中最长公共子串的 python 脚本(可能需要调整才能适用于多单词短语):

def lcs(word1, word2):

    w1 = set(word1[i:j] for i in range(0, len(word1))
             for j in range(1, len(word1) + 1))

    w2 = set(word2[i:j] for i in range(0, len(word2))
             for j in range(1, len(word2) + 1))

    common_subs = w1.intersection(w2)

    sorted_cmn_subs = sorted([
        (len(str), str) for str in list(common_subs)
        ])

    return sorted_cmn_subs.pop()[1]

Here's a python script for computing longest common substring in two words(may need tweaking to work for multi-word phrases):

def lcs(word1, word2):

    w1 = set(word1[i:j] for i in range(0, len(word1))
             for j in range(1, len(word1) + 1))

    w2 = set(word2[i:j] for i in range(0, len(word2))
             for j in range(1, len(word2) + 1))

    common_subs = w1.intersection(w2)

    sorted_cmn_subs = sorted([
        (len(str), str) for str in list(common_subs)
        ])

    return sorted_cmn_subs.pop()[1]
自控 2024-07-22 23:11:07

看一下 Fuzzy 模块。 它具有基于快速(用 C 语言编写)的 soundex、NYSIIS 和 double-metaphone 算法。

可以在以下位置找到很好的介绍: http://www.informit.com/articles /article.aspx?p=1848528

Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.

A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文