当前位置：文江博客话题详情

确定两个字符串是否“足够相似”的良好指标是什么？

发布于 2024-12-20 09:00:40 字数 1694 浏览 4 评论 0原文

我正在研究一个非常粗略的初稿算法来确定 2 个字符串的相似程度。我还使用 Levenshtein Distance 来计算字符串之间的编辑距离。

我目前所做的基本上是获取编辑总数并将其除以较大字符串的大小。如果该值低于某个阈值（当前随机设置为 25%），则它们“足够相似”。

然而，这完全是任意的，我认为这不是计算相似度的好方法。是否有某种数学方程或概率/统计方法来获取编辑距离数据并用它来表示“是的，根据所做的编辑次数和字符串的大小，这些字符串足够相似”？

另外，这里的关键是我使用的是任意阈值，我不想这样做。我如何计算这个阈值而不是分配它，以便我可以安全地说 2 个字符串“足够相似”？

更新

我正在比较表示 Java 堆栈跟踪的字符串。我想这样做的原因是通过相似性对一堆给定的堆栈跟踪进行分组，并将其用作过滤器来对“东西”进行排序:)这种分组对于更高层次的原因很重要，而我无法完全公开分享。

到目前为止，我的算法（伪代码）大致如下：

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

原文

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.

So far, my algorithm (pseudo code) is roughly along the lines of:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为你拒绝所有暧昧 2024-12-27 09:00:40

使用余弦相似度怎么样？这是评估两个文本之间相似性的通用技术。它的工作原理如下：

取出两个字符串中的所有字母并构建一个如下表：

Letter | String1 | String2

这可以是一个简单的哈希表或其他表。

在字母列中放置每个字母，在字符串列中放置它们在该字符串中出现的频率（如果某个字母未出现在字符串中，则值为 0）。

它被称为余弦相似度，因为您将两个字符串列中的每一个都解释为向量，其中每个分量都是与字母关联的数字。接下来，计算向量之间“角度”的余弦：

C = (V1 * V2) / (|V1| * |V2|)

分子是点积，即相应分量的乘积之和，分母是向量大小的乘积。

C 与 1 的接近程度表明字符串的相似程度。

它可能看起来很复杂，但一旦你理解了这个想法，这只是几行代码。

让我们看一个例子：考虑字符串

s1 = aabccdd
s2 = ababcd

表格看起来像：

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

因此：

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

所以它们“非常”相似。

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

Take all the letters from both Strings an build a table like this:

Letter | String1 | String2

This can be a simple hash table or whatever.

In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

C = (V1 * V2) / (|V1| * |V2|)

The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

How close C is to 1 gives you how similar the Strings are.

It may seem complicated but it's just a few lines of code once you understand the idea.

Let's see an example: consider the strings

s1 = aabccdd
s2 = ababcd

The table looks like:

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

And thus:

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

So they are "pretty" similar.

回复收藏 0 原文

不醒的梦 2024-12-27 09:00:40

堆栈跟踪采用适合解析的格式。我将使用解析库解析堆栈跟踪，然后您可以提取您想要比较的任何语义内容。

当字符串未按您的预期进行比较时，相似性算法将会变慢且难以调试。

回复收藏 0 原文

筱果果 2024-12-27 09:00:40

这是我对此的看法 - 只是一个需要考虑的长故事，不一定能解决您的问题：

我过去做过类似的事情，我会尝试通过简单地重新排列句子同时保持相同的形式来确定某人是否抄袭信息。

1“我们吃饭的时候孩子们应该玩”
2“我们吃晚饭的时候，孩子们应该玩耍”
3 “我们应该在玩耍时吃掉孩子”

所以编辑在这里没有多大用处，因为它是线性的，而且每个都会有很大不同。标准差将通过测试，学生将逃脱惩罚。

因此，我将句子中的每个单词分解，并将句子重新组合为数组，然后相互比较，首先确定每个数组中是否存在该单词，以及它与最后一个数组的关系。然后每个单词都会检查数组中的下一个单词，以确定是否存在连续的单词，就像我在第 1 行和第 2 行上面的示例句子中一样。
因此，如果存在连续的单词，我将组成每个数组共有的每个序列的字符串，然后尝试查找剩余单词的差异。剩余的单词越少，它们就越有可能只是填充物，以使其看起来不那么抄袭。

“当我们吃晚饭时，我认为孩子们应该玩”

然后“我认为”根据关键词词典进行评估并被认为是填充物 - 这部分在这里很难描述。

这是一个复杂的项目，它做了比我所描述的更多的事情，也不是我可以轻松共享的简单代码块，但上面的想法并不难复制。

祝你好运。我很想知道其他 SO 成员对你的问题有何看法。