字符串之间的缩写相似性

发布于 2025-02-10 19:24:21 字数 1439 浏览 1 评论 0原文

我的项目中有一个用例，我需要将键 string与许多字符串进行比较。如果此值大于某个阈值，我认为这些字符串与我的键< / code>且基于该列表“相似”，我进行了一些进一步的计算 /处理。

我一直在探索模糊匹配的字符串相似性内容，这些内容使用编辑距离基于“ levenshtein，jaro和jaro-winkler”等算法。

尽管它们正常工作，但如果一个字符串是另一个字符串的“缩写”，我希望获得更高的相似性得分。我可以使用任何算法/实现吗？

注意：

language: python3 
packages explored: fuzzywuzzy, jaro-winkler

示例：

using jaro_winkler similarity:

>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883

using levenshtein similarity:

>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30

在这种情况下，如果可能的话，我希望得分更高（＆gt; 90％）。我也可以有几个误报，因为它们不会在我的进一步计算中引起太多问题。但是，如果我们匹配S1和S2，以使S1完全包含在S2中（反之亦然），则它们的相似性得分应更高。

编辑：为我的用例的进一步示例

，空间是多余的。这意味着，wtw被认为是“ Willistowerwatson”和“ Willis Tower Watson”的缩写。

另外，stove是“堆栈溢出”或“ standartoverview”的有效缩写，

一个简单的算法是从较小字符串的第一个字符开始，看看它是否存在于较大的字符串中。然后检查第二个字符，依此类推，直到条件满足第一个字符串完全包含在第二字符串中。这对我来说是100％的匹配。

wtwx诸如“ willistowerwatson”之类的进一步示例可以给出80％的分数（这可以基于某些编辑距离逻辑）。即使我可以找到一个给出true或false的软件包，缩写相似性也将很有帮助。

原文

I have a use case in my project where I need to compare a key-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key and based on that list, I do some further calculations / processing.

I have been exploring fuzzy matching string similarity stuff, which use edit distance based algorithms like "levenshtein, jaro and jaro-winkler" similarities.

Although they work fine, I want to have a higher similarity score if one string is "abbreviation" of another. Is there any algorithm/ implementation I can use for this.

Note:

language: python3 
packages explored: fuzzywuzzy, jaro-winkler

Example:

using jaro_winkler similarity:

>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883

using levenshtein similarity:

>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30

In these kind of cases, I want score to be higher (>90%) if possible. I'm ok with few false positives as well, as they won't cause too much issue with my further calculations. But if we match s1 and s2 such that s1 is fully contained in s2 (or vice versa), their similarity score should be much higher.

Edit: Further Examples for my Use-Case

For me, spaces are redundant. That means, wtw is considered abbreviation for "willistowerwatson" and "willis tower watson" alike.

Also, stove is a valid abbreviation for "STack OVErflow" or "STandardOVErview"

A simple algo would be to start with 1st char of smaller string and see if it is present in the larger one. Then check for 2nd char and so on until the condition satisfies that 1st string is fully contained in 2nd string. This is a 100% match for me.

Further examples like wtwx to "willistowerwatson" could give a score of, say 80% (this can be based on some edit distance logic). Even if I can find a package which gives either True or False for abbreviation similarity would also be helpful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

(り薆情海 2025-02-17 19:24:21

要检测字符串中的缩写，您仍然可以使用fuzzywuzzy使用Process（）函数：

from fuzzywuzzy import fuzz, process

s1 = ["willis tower watson", "stack overflow", "willistowerwatson", "international business machines"]
s2 = ['wtw', "so", "wtw", "ibz"]

queries = [''.join([i[0] for i in j.split()]) for j in s1]

for query, company in zip(queries, s1):
    print(company, '-', process.extractOne(query, s2, scorer=fuzz.partial_token_sort_ratio))

output：

willis tower watson - ('wtw', 100)
stack overflow - ('so', 100)
willistowerwatson - ('wtw', 100)
international business machines - ('ibz', 67)

To detect abbrevioations in string, you can still using fuzzywuzzy module with the process() function:

from fuzzywuzzy import fuzz, process

s1 = ["willis tower watson", "stack overflow", "willistowerwatson", "international business machines"]
s2 = ['wtw', "so", "wtw", "ibz"]

queries = [''.join([i[0] for i in j.split()]) for j in s1]

for query, company in zip(queries, s1):
    print(company, '-', process.extractOne(query, s2, scorer=fuzz.partial_token_sort_ratio))

Output:

willis tower watson - ('wtw', 100)
stack overflow - ('so', 100)
willistowerwatson - ('wtw', 100)
international business machines - ('ibz', 67)

回复收藏 0 原文

与风相奔跑 2025-02-17 19:24:21

您可以使用类似于序列比对的递归算法。只是不要对轮班的罚款（正如它们在缩写中所期望的那样），而是要对第一字符的不匹配。

例如：

def abbreviation(abr,word,penalty=1):
    if len(abr)==0:
        return 0
    elif len(word)==0:
        return penalty*len(abr)*-1
    elif abr[0] == word[0]:
        if len(abr)>1:
            return 1 + max(abbreviation(abr[1:],word[1:]),
                           abbreviation(abr[2:],word[1:])-penalty)
        else:
            return 1 + abbreviation(abr[1:],word[1:])
    else:
        return abbreviation(abr,word[1:])

def compute_match(abbr,word,penalty=1):
    score = abbreviation(abbr.lower(),
                         word.lower(),
                         penalty)
    if abbr[0].lower() != word[0].lower(): score-=penalty
    
    score = score/len(abbr)

    return score


print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))

输出是：

1.0
1.0
1.0
0.6666666666666666
0.5

指示wtw和wtwo是willistowerwatson wtwo ，stove是完全有效的缩写。是stackoverflow的有效缩写，但不是tov，它的第一个字符错误。
wtwx只是willistowerwatson beacuse的部分有效的缩写，其结尾是不在全名中出现的字符。

You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.

This one should work, for example:

def abbreviation(abr,word,penalty=1):
    if len(abr)==0:
        return 0
    elif len(word)==0:
        return penalty*len(abr)*-1
    elif abr[0] == word[0]:
        if len(abr)>1:
            return 1 + max(abbreviation(abr[1:],word[1:]),
                           abbreviation(abr[2:],word[1:])-penalty)
        else:
            return 1 + abbreviation(abr[1:],word[1:])
    else:
        return abbreviation(abr,word[1:])

def compute_match(abbr,word,penalty=1):
    score = abbreviation(abbr.lower(),
                         word.lower(),
                         penalty)
    if abbr[0].lower() != word[0].lower(): score-=penalty
    
    score = score/len(abbr)

    return score


print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))

The output is:

1.0
1.0
1.0
0.6666666666666666
0.5

Indicating that wtw and wtwo are perfectly valid abbreviations for willistowerwatson, that stove is a valid abbreviation of Stackoverflow but not tov, which has the wrong first character.
And wtwx is only partially valid abbreviation for willistowerwatson beacuse it ends with a character that does not occur in the full name.