当前位置：文江博客话题详情

寻找文本差异算法来检测并分组相似的行

发布于 2024-08-21 05:47:00 字数 1455 浏览 11 评论 0原文

我正在编写一个差异文本工具来比较两个相似的源代码文件。

周围有很多这样的“差异”工具，但我的应该稍微改进：

如果它发现一组行在两侧都不匹配（即在两个文件中），它不仅应该突出显示这些行，还应该突出显示单独的行这些行的变化（我在这里称之为行间比较）。

我的解决方案的一个例子：

alt text http://files.tempel.org/tmp/ diff_example.png

它当前所做的是获取一组不匹配的行，并再次通过 diff 算法运行它们的单个字符，产生粉红色突出显示。

然而，包含“original 2”的第二组不匹配需要更多工作：这里，添加了前两行右侧行（“添加行 a/b”），而第三行是左侧的更改版本。我希望我的软件能够检测到可能的更改和可能的新行之间的差异。

当看这个简单的例子时，我可以很容易地检测到这种情况：

使用像 Levenshtein 这样的算法，我可以发现在 3 到 5 组中的所有右侧行中，第 5 行与左侧第 3 行最匹配，因此我可以推断添加右侧第 3 行和第 4 行，并对左侧第 3 行和右侧第 5 行进行行间比较。

到目前为止，一切顺利。但我仍然不知道如何将其转变为用于此目的的更通用的算法。

在更复杂的情况下，一组不同的线可能在两侧都添加了线，并且中间有一些紧密匹配的线。这变得非常复杂：

我不仅必须将左侧的第一行与右侧的第一行相匹配，而且还必须将左侧的第一行与右侧的第一行匹配，反之亦然，依此类推。基本上，我必须将左侧的每一行与右侧的每一行进行匹配。在最坏的情况下，这可能会产生均匀的交叉，因此不再容易清楚哪些行是新插入的，哪些行刚刚更改的（注意：我不想在这样的块中处理可能移动的行，除非这实际上会简化算法）。

当然，这永远不会是完美的，但我正在努力让它比现在更好。任何不太理论但相当实用的建议（我不太理解抽象算法）都会受到赞赏。

更新

我必须承认，我什至不明白 LCS 算法是如何工作的。我简单地向它提供两个字符串数组，然后输出一个序列不匹配的列表。我基本上使用这里的代码： http://www.incava.org/projects /java/java-diff

查看代码，我发现一个函数 equal() 负责告诉算法两行是否匹配。根据帕维尔的建议，我想知道这是否是我要进行更改的地方。但如何呢？该函数仅返回一个布尔值 - 而不是可以识别匹配质量的相对值。而且我不能简单地使用固定的 Levenshtein 比率来决定相似的行是否仍然被认为是相等的 - 我需要一些能够自我适应整个相关行集的东西。

所以，我基本上想说的是，我仍然不明白在哪里应用与不（完全）匹配的行的相对相似性相关的模糊值。

原文

I am in the process of writing a diff text tool to compare two similar source code files.

There are many such "diff" tools around, but mine shall be a little improved:

If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).

An example of my somewhat working solution:

alt text http://files.tempel.org/tmp/diff_example.png

What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.

However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.

When looking at this simple example, I can rather easily detect this case:

With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.

So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.

In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:

I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).

Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.

Update

I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff

Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.

So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

第几種人 2024-08-28 05:47:00

编辑距离基于将一个字符串转换为另一个字符串的“编辑脚本”的概念。它与用于比对 DNA 的 Needleman-Wunsch 算法密切相关通过插入间隙字符来序列，我们使用动态规划在 O(nm) 时间内搜索最大化分数的比对。字符之间的精确匹配会增加分数，而不匹配或插入的间隙字符则会降低分数。 AACTTGCA 和 AATGCGAT 的对齐示例：

AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)

我们可以将顶部字符串视为“起始”序列，我们将其转换为底部的“最终”序列。底部包含 - 间隙字符的每一列是删除，顶部包含 - 的每一列是插入，并且每一列具有不同的（非间隙）字符是替换。上面的比对中有 2 次删除、1 次插入和 1 次替换，因此编辑距离为 4。

这是相同字符串的另一个比对，具有相同的编辑距离：

AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)

但请注意，虽然间隙数量相同，但编辑距离是 4。是少一个间隙区域。由于生物过程更有可能产生较宽的间隙而不是多个单独的间隙，因此生物学家更喜欢这种对齐方式 - 程序的用户也会如此。这是通过同时惩罚我们计算的分数中的间隙区域数量来实现的。 Gotoh 于 1982 年在一篇名为“An Improvement匹配生物序列的算法”。不幸的是，我找不到该论文的免费全文的任何链接——但是您可以通过谷歌搜索“序列比对”和“仿射间隙罚分”找到许多有用的教程。

一般来说，匹配、不匹配、间隙和间隙区域权重的不同选择将给出不同的对齐方式，但间隙区域的任何负分数将优先选择上面的底部对齐方式而不是顶部对齐方式。

所有这些与您的问题有什么关系？如果您对具有适当间隙惩罚的单个字符使用 Gotoh 算法（通过一些经验测试得出），您应该会发现像你给出的例子一样，有很多看起来很糟糕的对齐方式。

效率考虑

因素理想情况下，您可以只对字符执行此操作并完全忽略行，因为仿射惩罚将尽可能将更改聚集到跨越多行的块中。但由于运行时间较长，对行进行第一次传递，然后对字符重新运行算法，使用所有不相同的行作为输入，可能更现实。在这种方案下，任何相同行的共享块都可以通过将其压缩为具有膨胀匹配权重的单个“字符”来处理，这有助于确保不会出现“交叉”。

Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of AACTTGCCA and AATGCGAT:

AACTTGCCA-
AA-T-GCGAT
(6 matches, 1 mismatch, 3 gap characters, 3 gap regions)

We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a - gap character on the bottom is a deletion, each column with a - on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.

Here is another alignment of the same strings, with the same Levenshtein distance:

AACTTGCCA-
AA--TGCGAT
(6 matches, 1 mismatch, 3 gap characters, 2 gap regions)

But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".

In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.

What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.

Efficiency Considerations

Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.

回复收藏 0 原文