确定两个字符串是否“足够相似”的良好指标是什么?

发布于 2024-12-20 09:00:40 字数 1694 浏览 1 评论 0原文

我正在研究一个非常粗略的初稿算法来确定 2 个字符串的相似程度。我还使用 Levenshtein Distance 来计算字符串之间的编辑距离。

我目前所做的基本上是获取编辑总数并将其除以较大字符串的大小。如果该值低于某个阈值(当前随机设置为 25%),则它们“足够相似”。

然而,这完全是任意的,我认为这不是计算相似度的好方法。是否有某种数学方程或概率/统计方法来获取编辑距离数据并用它来表示“是的,根据所做的编辑次数和字符串的大小,这些字符串足够相似”?

另外,这里的关键是我使用的是任意阈值,我不想这样做。我如何计算这个阈值而不是分配它,以便我可以安全地说 2 个字符串“足够相似”

更新

我正在比较表示 Java 堆栈跟踪的字符串。我想这样做的原因是通过相似性对一堆给定的堆栈跟踪进行分组,并将其用作过滤器来对“东西”进行排序:)这种分组对于更高层次的原因很重要,而我无法完全公开分享。


到目前为止,我的算法(伪代码)大致如下:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.


So far, my algorithm (pseudo code) is roughly along the lines of:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

为你拒绝所有暧昧 2024-12-27 09:00:40

使用余弦相似度怎么样?这是评估两个文本之间相似性的通用技术。它的工作原理如下:

取出两个字符串中的所有字母并构建一个如下表:

Letter | String1 | String2

这可以是一个简单的哈希表或其他表。

在字母列中放置每个字母,在字符串列中放置它们在该字符串中出现的频率(如果某个字母未出现在字符串中,则值为 0)。

它被称为余弦相似度,因为您将两个字符串列中的每一个都解释为向量,其中每个分量都是与字母关联的数字。接下来,计算向量之间“角度”的余弦:

C = (V1 * V2) / (|V1| * |V2|)

分子是点积,即相应分量的乘积之和,分母是向量大小的乘积。

C 与 1 的接近程度表明字符串的相似程度。

它可能看起来很复杂,但一旦你理解了这个想法,这只是几行代码。

让我们看一个例子:考虑字符串

s1 = aabccdd
s2 = ababcd

表格看起来像:

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

因此:

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

所以它们“非常”相似。

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

Take all the letters from both Strings an build a table like this:

Letter | String1 | String2

This can be a simple hash table or whatever.

In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

C = (V1 * V2) / (|V1| * |V2|)

The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

How close C is to 1 gives you how similar the Strings are.

It may seem complicated but it's just a few lines of code once you understand the idea.

Let's see an example: consider the strings

s1 = aabccdd
s2 = ababcd

The table looks like:

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

And thus:

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

So they are "pretty" similar.

不醒的梦 2024-12-27 09:00:40

堆栈跟踪采用适合解析的格式。我将使用解析库解析堆栈跟踪,然后您可以提取您想要比较的任何语义内容。

当字符串未按您的预期进行比较时,相似性算法将会变慢且难以调试。

Stack traces are in a format amenable to parsing. I would just parse the stack traces using a parsing library and then you can extract whatever semantic content you want to compare.

Similarity algorithms are going to be slower and difficult to debug with when strings aren't comparing as you expect.

筱果果 2024-12-27 09:00:40

这是我对此的看法 - 只是一个需要考虑的长故事,不一定能解决您的问题:

我过去做过类似的事情,我会尝试通过简单地重新排列句子同时保持相同的形式来确定某人是否抄袭信息。

1“我们吃饭的时候孩子们应该玩”
2“我们吃晚饭的时候,孩子们应该玩耍”
3 “我们应该在玩耍时吃掉孩子”

所以编辑在这里没有多大用处,因为它是线性的,而且每个都会有很大不同。标准差将通过测试,学生将逃脱惩罚。

因此,我将句子中的每个单词分解,并将句子重新组合为数组,然后相互比较,首先确定每个数组中是否存在该单词,以及它与最后一个数组的关系。然后每个单词都会检查数组中的下一个单词,以确定是否存在连续的单词,就像我在第 1 行和第 2 行上面的示例句子中一样。
因此,如果存在连续的单词,我将组成每个数组共有的每个序列的字符串,然后尝试查找剩余单词的差异。剩余的单词越少,它们就越有可能只是填充物,以使其看起来不那么抄袭。

“当我们吃晚饭时,我认为孩子们应该玩”

然后“我认为”根据关键词词典进行评估并被认为是填充物 - 这部分在这里很难描述。

这是一个复杂的项目,它做了比我所描述的更多的事情,也不是我可以轻松共享的简单代码块,但上面的想法并不难复制。

祝你好运。我很想知道其他 SO 成员对你的问题有何看法。

Here's my take on this - just a long story to consider and not necessarily an answer to your problem:

I've done something similar in the past where I would try to determine if someone was plagiarizing by simply rearranging sentences while maintaining the same sort of message.

1 "children should play while we eat dinner"
2 "while we eat dinner, the children should play"
3 "we should eat children while we play"

So levenshtein wouldn't be of much use here because it is linear and each one would be considerably different. The standard difference would pass the test and the student would get away with the crime.

So I broke each word in the sentences up and recomposed the sentences as arrays, then compared each other to first determine if the word existed in each array, and where it was in relation to the last. Then each word would check the next in the array to determine if there were sequential words, like in my example sentences above line 1 and 2.
So if there were sequential words, I would compose a string of each sequence common to each array and then attempt to find differences in the remaining words. The fewer remaining words, the more likely they are just filler to make it seem less plagiarized.

"while we eat dinner, I think the children should play"

Then "I think" is evaluated and considered filler based on a keyword lexicon - this part is hard to describe here.

This was a complex project that did a lot more than just what I described and not a simple chunk of code I can easily share, but the idea above is not too hard to replicate.

Good luck. I'm interested in what other SO members have to say about your question.

败给现实 2024-12-27 09:00:40

由于编辑距离永远不会大于较长字符串的长度,因此我当然会将分母从 (length1 + length2) 更改为 Math.max(length1, length2)。这会将指标标准化为 0 到 1 之间。

现在,不可能根据所提供的信息来回答什么是“足够相似”以满足您的需求。我个人尝试避免像 0.25 截止值那样的阶跃函数,更喜欢已知区间的连续值。也许将连续的“相似性”(或“距离”)值输入到更高级别的算法中而不是将这些值转换为二进制值会更好?

Since the Levenshtein distance is never greater than the length of the longer string, I'd certainly change the denominator from (length1 + length2) to Math.max(length1, length2). This would normalize the metric to be between zero and one.

Now, it's impossible to answer what's "similar enough" for your needs based on the information provided. I personally try to avoid step functions like you have with the 0.25 cutoff, preferring continuous values from a known interval. Perhaps it would be better to feed the continuous "similarity" (or "distance") values into higher-level algorithms instead of transforming those values into binary ones?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文