字符串相似度分数/哈希值

发布于 2024-10-05 18:53:11 字数 546 浏览 11 评论 0原文

有没有一种方法可以计算字符串的一般“相似度分数”?在某种程度上,我不是将两个字符串比较在一起,而是为每个字符串获取一些数字(散列),这些数字稍后可以告诉我两个字符串是否相似。两个相似的字符串应该具有相似(接近)的哈希值。

让我们以这些字符串和分数为例:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

您可以看到 Hello world!Hello world 很相似,而且它们的分数也很接近。

这样,找到与给定字符串最相似的字符串可以通过从其他分数中减去给定字符串分数然后对它们的绝对值进行排序来完成。

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.

Let's consider these strings and scores as an example:

Hello world                1000
Hello world!               1010
Hello earth                1125
Foo bar                    3250
FooBarbar                  3750
Foo Bar!                   3300
Foo world!                 2350

You can see that Hello world! and Hello world are similar and their scores are close to each other.

This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

不念旧人 2024-10-12 18:53:14

也许使用 PCA,其中矩阵是字符串和固定字母表之间的差异列表( à la ABCDEFGHI...)。答案可能只是主成分的长度。

只是一个想法。

C# 中可立即运行的 PCA

Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.

Just an idea.

ready-to-run PCA in C#

又怨 2024-10-12 18:53:14

人们不可能从两个短语中得到相当小的数字,通过比较这些数字来提供其初始短语相似性的相关指示。
原因是数字给出了一个维度的指示,而短语则在两个维度(长度和强度)上演变。

这个数字的长度和强度都可以发展,但我不确定它会有多大帮助。

在二维中,您最好查看一个矩阵,其中一些属性,如行列式(矩阵的一种导数)可以给出短语“趋势”的粗略概念。

It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.

The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.

In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.

︶葆Ⅱㄣ 2024-10-12 18:53:14

自然语言处理中,我们有一个叫做最小编辑距离(也称为编辑距离)
它基本上定义为将 string1 转换为 string2 所需的最小操作量
操作包括插入、删除、替换,每个操作都会给出一个分数,您可以将其添加到距离
解决问题的想法是计算从您选择的字符串到所有其他字符串的 MED,对该集合进行排序并选出第 n 个第一个最小距离字符串
例如:

{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"  
Med(base-string, "Hello World!") = 1  
Med(base-string, "Hello Earth") = 8  
1st closest string is "Hello World!"

这在某种程度上给了字符串集合中的每个字符串一个分数
C# 实现(Add-1、Deletion-1、Subsitution-2)

public static int Distance(string s1, string s2)
{
    int[,] matrix = new int[s1.Length + 1, s2.Length + 1];

    for (int i = 0; i <= s1.Length; i++)
        matrix[i, 0] = i;
    for (int i = 0; i <= s2.Length; i++)
        matrix[0, i] = i;

    for (int i = 1; i <= s1.Length; i++)
    {
        for (int j = 1; j <= s2.Length; j++)
        {
            int value1 = matrix[i - 1, j] + 1;
            int value2 = matrix[i, j - 1] + 1;
            int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);

            matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
        }
    }

    return matrix[s1.Length, s2.Length];
}

复杂度O(nxm),其中 n、m 是每个字符串的长度
有关最小编辑距离的更多信息可以在此处找到

In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:

{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"  
Med(base-string, "Hello World!") = 1  
Med(base-string, "Hello Earth") = 8  
1st closest string is "Hello World!"

This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)

public static int Distance(string s1, string s2)
{
    int[,] matrix = new int[s1.Length + 1, s2.Length + 1];

    for (int i = 0; i <= s1.Length; i++)
        matrix[i, 0] = i;
    for (int i = 0; i <= s2.Length; i++)
        matrix[0, i] = i;

    for (int i = 1; i <= s1.Length; i++)
    {
        for (int j = 1; j <= s2.Length; j++)
        {
            int value1 = matrix[i - 1, j] + 1;
            int value2 = matrix[i, j - 1] + 1;
            int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);

            matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
        }
    }

    return matrix[s1.Length, s2.Length];
}

Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here

蝶舞 2024-10-12 18:53:14

好吧,您可以将每个字符的 ascii 值相加,然后比较分数,得出它们可以不同的最大值。然而,这并不能保证它们是相似的,出于同样的原因,两个不同的字符串可以具有相同的哈希值。

当然,您可以创建一个更复杂的函数,首先检查字符串的大小,然后一一比较每个字符,再次设置最大差异。

Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.

You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

七秒鱼° 2024-10-12 18:53:13

在无界问题中,没有解决方案可以将任何可能的单词序列或任何可能的字符序列转换为描述位置的单个数字。

想象一下字符级别的相似性

stops
spots

hello world
world hello

在这两个示例中,消息不同,但消息中的字符相同,因此度量需要保存位置值以及字符值。 (char 0 == 'h', char 1 == 'e' ...)

然后比较以下类似消息

hello world
ello world

虽然两个字符串相似,但它们的开头或结尾可能不同,这使得按位置缩放有问题的。

在“单词”的情况下,

spots
stops

仅因字符的位置而异,因此某种形式的位置很重要。

如果以下字符串相似

 yesssssssssssssss
 yessssssssssssss

,那么您就遇到了一种悖论。如果向第二个字符串添加 2 个 s 字符,它应该共享与第一个字符串的距离,但应该是不同的。可以重复此操作,逐渐获得更长的字符串,所有这些字符串都需要接近比它们更短和更长的字符串。我看不出如何实现这一目标。

一般来说,这被视为一个多维问题 - 将字符串分解为向量

[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]

,但是向量的值不能

  • 用固定大小的数字表示,也不能
  • 给出良好的质量差异度量。

如果单词的数量或字符串的长度是有限的,那么编码的解决方案是可能的。

有界值

使用算术压缩之类的方法,可以将单词序列转换为表示该序列的浮点数。然而,这会将序列中较早的项目视为比序列中最后一个项目更重要。

数据挖掘解决方案

如果您接受问题是高维的,那么您可以将字符串存储在度量树中 wikipedia :度量树。这会限制您的搜索空间,同时无法解决您的“单一数字”解决方案。

我在 github : clustering 上有这样的代码

靠近在一起的项目应该存储在一起树,但确实没有保证。子树的半径用于修剪搜索空间。

编辑距离或编辑距离

这在 sqlite 扩展中用于执行相似性搜索,但没有单个数字解决方案,它可以计算出有多少编辑将一个字符串更改为另一个字符串。然后得出一个分数,显示相似性。

In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.

Imagine similarity at the character level

stops
spots

hello world
world hello

In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)

Then compare the following similar messages

hello world
ello world

Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.

In the case of

spots
stops

The words only differ by position of the characters, so some form of position is important.

If the following strings are similar

 yesssssssssssssss
 yessssssssssssss

Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.

In general this is treated as a multi-dimensional problem - breaking the string into a vector

[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]

But the values of the vector can not be

  • represented by a fixed size number, or
  • give good quality difference measure.

If the number of words, or length of strings were bounded, then a solution of coding may be possible.

Bounded values

Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.

data mining solution

If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.

I have code for such at github : clustering

Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.

Edit Distance or Levenshtein distance

This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.

生生不灭 2024-10-12 18:53:13

我想到了这样的事情:

  1. 删除所有非单词字符
  2. apply soundex

I think of something like this:

  1. remove all non-word characters
  2. apply soundex
海夕 2024-10-12 18:53:13

你的想法听起来像本体论,但适用于整个短语。两个短语越相似,它们在图中越接近(假设您使用的是加权边)。反之亦然:不相似的短语彼此相距甚远。

另一种方法是使用傅立叶变换来获取给定字符串的“索引”排序(它不会是单个数字,但始终是)。您可能会在本文中找到更多信息。

另一个想法是基于 Levenshtein 距离:您可以比较 n 元语法,这将为您提供两个给定短语的一些相似性指数 - 它们越相似,值越接近 1。这可用于计算图形。几年前写过一篇关于这个的论文,如果你愿意我可以分享它。

无论如何:尽管我不知道确切的解决方案,但我也对你会想到的感兴趣。

Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.

Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.

And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.

Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.

您的好友蓝忘机已上羡 2024-10-12 18:53:12

我相信您正在寻找的称为位置敏感哈希。虽然大多数哈希算法的设计使得输入的微小变化会导致输出的较大变化,但这些哈希算法却试图相反:输入的微小变化会按比例产生输出的微小变化。

正如其他人所提到的,将多维映射强制转换为二维映射存在固有的问题。这类似于创建地球的平面地图……你永远无法在平面上准确地表示球体。您能做的最好的事情就是找到一个针对您用来确定字符串是否“相似”的任何功能进行优化的 LSH。

I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.

As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".

扶醉桌前 2024-10-12 18:53:12

Levenstein 距离或其导数就是您想要的算法。
将给定字符串与字典中的每个字符串进行匹配。
(在这里,如果您只需要固定数量的最相似字符串,您可能需要使用最小堆。)
如果对字典中的所有字符串运行 Levenstein 距离太昂贵,那么使用一些粗略的
第一个算法将从候选列表中排除太远的单词。
之后,对左侧候选者运行 Levenstein 距离。


删除距离较远的单词的一种方法是对 n-gram 进行索引。
通过将每个单词拆分为 n 元语法列表来预处理字典。
例如,考虑 n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

接下来,创建 n-grams 索引:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

当您需要为给定字符串查找最相似的字符串时,您可以将给定字符串拆分为 n-grams 并仅选择那些
字典中至少有一个匹配的 n-gram 的单词。
这会将候选者的数量减少到合理的数量,并且您可以继续将给定字符串与每个左侧候选者进行编辑匹配。


如果您的字符串足够长,您可以使用最小散列技术来减少索引大小:
您计算每个 n 元组的普通哈希值,并仅使用 K 个最小哈希值,其他哈希值将被丢弃。

PS此演示文稿似乎很好地介绍了您的问题。

Levenstein distance or its derivatives is the algorithm you want.
Match given string to each of strings from dictionary.
(Here, if you need only fixed number of most similar strings, you may want to use min-heap.)
If running Levenstein distance for all strings in dictionary is too expensive, then use some rough
algorithm first that will exclude too distant words from list of candidates.
After that, run levenstein distance on left candidates.


One way to remove distant words is to index n-grams.
Preprocess dictionary by splitting each of words into list of n-grams.
For example, consider n=3:

(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]

Next, create index of n-gramms:

" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]

When you need to find most similar strings for given string, you split given string into n-grams and select only those
words from dictionary which have at least one matching n-gram.
This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.


If your strings are long enough, you may reduce index size by using min-hashing technnique:
you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.

P.S. this presentation seems like a good introduction to your problem.

疯到世界奔溃 2024-10-12 18:53:12

一般来说,这是不可能的,因为字符串之间的编辑距离集形成了一个度量空间,但没有固定尺寸。这意味着您无法提供字符串和整数之间的映射来保留它们之间的距离度量。

例如,您不能为这三个短语分配数字:

  • 一二
  • 一六
  • 二六

,这样数字反映了所有三个短语之间的差异。

This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.

For example, you cannot assign numbers to these three phrases:

  • one two
  • one six
  • two six

Such that the numbers reflect the difference between all three phrases.

手心的温暖 2024-10-12 18:53:12

虽然这个想法看起来非常美好……但我从来没有听说过这个。

我读过很多很多关于拼写纠正/拼写错误纠正主题的技术、论文和科学论文,最快的建议围绕索引和编辑距离。

有相当复杂的技术,我目前正在研究的技术结合了:

  • 突发特里树,具有关卡紧凑性
  • 编辑自动机

尽管这并不意味着“不可能”获得分数,但我以某种方式认为不会有这样的情况如果这种“评分”方法被证明是有效的,那么最近有很多关于字符串比较的研究。

如果你找到这样的方法,我非常感兴趣:)

While the idea seems extremely sweet... I've never heard of this.

I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.

There are fairly elaborated technics, the one I am currently working on combines:

  • A Bursted Trie, with level compactness
  • A Levenshtein Automaton

Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.

If you ever find such a method, I am extremely interested :)

来世叙缘 2024-10-12 18:53:12

编辑距离对您有用吗?

Would Levenshtein distance work for you?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文