字符串之间的缩写相似性
我的项目中有一个用例,我需要将键
string与许多字符串进行比较。如果此值大于某个阈值,我认为这些字符串与我的键< / code>且基于该列表“相似”,我进行了一些进一步的计算 /处理。
我一直在探索模糊匹配的字符串相似性内容,这些内容使用编辑距离
基于“ levenshtein,jaro和jaro-winkler”等算法。
尽管它们正常工作,但如果一个字符串是另一个字符串的“缩写”,我希望获得更高的相似性得分。我可以使用任何算法/实现吗?
注意:
language: python3
packages explored: fuzzywuzzy, jaro-winkler
示例:
using jaro_winkler similarity:
>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883
using levenshtein similarity:
>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30
在这种情况下,如果可能的话,我希望得分更高(&gt; 90%)。我也可以有几个误报,因为它们不会在我的进一步计算中引起太多问题。但是,如果我们匹配S1和S2,以使S1完全包含在S2中(反之亦然),则它们的相似性得分应更高。
编辑:为我的用例的进一步示例
,空间是多余的。这意味着,wtw
被认为是“ Willistowerwatson”和“ Willis Tower Watson”的缩写。
另外,stove
是“堆栈溢出”或“ standartoverview”的有效缩写,
一个简单的算法是从较小字符串的第一个字符开始,看看它是否存在于较大的字符串中。然后检查第二个字符,依此类推,直到条件满足第一个字符串完全包含在第二字符串中。这对我来说是100%的匹配。
wtwx
诸如“ willistowerwatson”之类的进一步示例可以给出80%的分数(这可以基于某些编辑距离逻辑)。即使我可以找到一个给出true
或false
的软件包,缩写相似性也将很有帮助。
I have a use case in my project where I need to compare a key
-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key
and based on that list, I do some further calculations / processing.
I have been exploring fuzzy matching string similarity stuff, which use edit distance
based algorithms like "levenshtein, jaro and jaro-winkler" similarities.
Although they work fine, I want to have a higher similarity score if one string is "abbreviation" of another. Is there any algorithm/ implementation I can use for this.
Note:
language: python3
packages explored: fuzzywuzzy, jaro-winkler
Example:
using jaro_winkler similarity:
>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883
using levenshtein similarity:
>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30
In these kind of cases, I want score to be higher (>90%) if possible. I'm ok with few false positives as well, as they won't cause too much issue with my further calculations. But if we match s1 and s2 such that s1 is fully contained in s2 (or vice versa), their similarity score should be much higher.
Edit: Further Examples for my Use-Case
For me, spaces are redundant. That means, wtw
is considered abbreviation for "willistowerwatson" and "willis tower watson" alike.
Also, stove
is a valid abbreviation for "STack OVErflow" or "STandardOVErview"
A simple algo would be to start with 1st char of smaller string and see if it is present in the larger one. Then check for 2nd char and so on until the condition satisfies that 1st string is fully contained in 2nd string. This is a 100% match for me.
Further examples like wtwx
to "willistowerwatson" could give a score of, say 80% (this can be based on some edit distance logic). Even if I can find a package which gives either True
or False
for abbreviation similarity would also be helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要检测字符串中的缩写,您仍然可以使用
fuzzywuzzy
使用Process()
函数:output:
To detect abbrevioations in string, you can still using
fuzzywuzzy
module with theprocess()
function:Output:
您可以使用类似于序列比对的递归算法。只是不要对轮班的罚款(正如它们在缩写中所期望的那样),而是要对第一字符的不匹配。
例如:
输出是:
指示
wtw
和wtwo
是willistowerwatson
wtwo ,stove是完全有效的缩写。
是stackoverflow
的有效缩写,但不是tov
,它的第一个字符错误。wtwx
只是willistowerwatson
beacuse的部分有效的缩写,其结尾是不在全名中出现的字符。You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.
This one should work, for example:
The output is:
Indicating that
wtw
andwtwo
are perfectly valid abbreviations forwillistowerwatson
, thatstove
is a valid abbreviation ofStackoverflow
but nottov
, which has the wrong first character.And
wtwx
is only partially valid abbreviation forwillistowerwatson
beacuse it ends with a character that does not occur in the full name.