寻找文本差异算法来检测并分组相似的行
我正在编写一个差异文本工具来比较两个相似的源代码文件。
周围有很多这样的“差异”工具,但我的应该稍微改进:
如果它发现一组行在两侧都不匹配(即在两个文件中),它不仅应该突出显示这些行,还应该突出显示单独的行这些行的变化(我在这里称之为行间比较)。
我的解决方案的一个例子:
alt text http://files.tempel.org/tmp/ diff_example.png
它当前所做的是获取一组不匹配的行,并再次通过 diff 算法运行它们的单个字符,产生粉红色突出显示。
然而,包含“original 2”的第二组不匹配需要更多工作:这里,添加了前两行右侧行(“添加行 a/b”),而第三行是左侧的更改版本。我希望我的软件能够检测到可能的更改和可能的新行之间的差异。
当看这个简单的例子时,我可以很容易地检测到这种情况:
使用像 Levenshtein 这样的算法,我可以发现在 3 到 5 组中的所有右侧行中,第 5 行与左侧第 3 行最匹配,因此我可以推断添加右侧第 3 行和第 4 行,并对左侧第 3 行和右侧第 5 行进行行间比较。
到目前为止,一切顺利。但我仍然不知道如何将其转变为用于此目的的更通用的算法。
在更复杂的情况下,一组不同的线可能在两侧都添加了线,并且中间有一些紧密匹配的线。这变得非常复杂:
我不仅必须将左侧的第一行与右侧的第一行相匹配,而且还必须将左侧的第一行与右侧的第一行匹配,反之亦然,依此类推。基本上,我必须将左侧的每一行与右侧的每一行进行匹配。在最坏的情况下,这可能会产生均匀的交叉,因此不再容易清楚哪些行是新插入的,哪些行刚刚更改的(注意:我不想在这样的块中处理可能移动的行,除非这实际上会简化算法)。
当然,这永远不会是完美的,但我正在努力让它比现在更好。任何不太理论但相当实用的建议(我不太理解抽象算法)都会受到赞赏。
更新
我必须承认,我什至不明白 LCS 算法是如何工作的。我简单地向它提供两个字符串数组,然后输出一个序列不匹配的列表。我基本上使用这里的代码: http://www.incava.org/projects /java/java-diff
查看代码,我发现一个函数 equal() 负责告诉算法两行是否匹配。根据帕维尔的建议,我想知道这是否是我要进行更改的地方。但如何呢?该函数仅返回一个布尔值 - 而不是可以识别匹配质量的相对值。而且我不能简单地使用固定的 Levenshtein 比率来决定相似的行是否仍然被认为是相等的 - 我需要一些能够自我适应整个相关行集的东西。
所以,我基本上想说的是,我仍然不明白在哪里应用与不(完全)匹配的行的相对相似性相关的模糊值。
I am in the process of writing a diff text tool to compare two similar source code files.
There are many such "diff" tools around, but mine shall be a little improved:
If it finds a set of lines are mismatched on both sides (ie. in both files), it shall not only highlight those lines but also highlight the individual changes in these lines (I call this inter-line comparison here).
An example of my somewhat working solution:
alt text http://files.tempel.org/tmp/diff_example.png
What it currently does is to take a set of mismatched lines and running their single chars thru the diff algo once more, producing the pink highlighting.
However, the second set of mismatches, containing "original 2", requires more work: Here, the first two right lines ("added line a/b") were added, while the third line is an altered version of the left side. I wish my software to detect this difference between a likely alteration and a probable new line.
When looking at this simple example, I can rather easily detect this case:
With an algo such as Levenshtein, I could find that of all right lines in the set of 3 to 5, line 5 matches left line 3 best, thus I could deduct that lines 3 and 4 on the right were added, and perform the inter-line comparison on left line 3 and right line 5.
So far, so good. But I am still stuck with how to turn this into a more general algorithm for this purpose.
In a more complex situation, a set of different lines could have added lines on both sides, with a few closely matching lines in between. This gets quite complicated:
I'd have to match not only the first line on the left to the best on the right, but vice versa as well, and so on with all other lines. Basically, I have to match every line on the left against every one on the right. At worst, this might create even crossings, so that it's not easily clear any more which lines were newly inserted and which were just altered (Note: I do not want to deal with possible moved lines in such a block, unless that would actually simplify the algorithm).
Sure, this is never going to be perfect, but I'm trying to get it better than it's now. Any suggestions that aren't too theoerical but rather practical (I'm not good understanding abstract algos) are appreciated.
Update
I must admit that I do not even understand how the LCS algo works. I simply feed it two arrays of strings and out comes a list of which sequences do not match. I am basically using the code from here: http://www.incava.org/projects/java/java-diff
Looking at the code I find one function equal() that is responsible for telling the algorithm whether two lines match or not. Based on what Pavel suggested, I wonder if that's the place where I'd make the changes. But how? This function only returns a boolean - not a relative value that could identify the quality of the match. And I can not simply used a fixed Levenshtein ration that would decide whether a similar line is still considered equal or not - I'll need something that's self-adopting to the entire set of lines in question.
So, what I'm basically saying is that I still do not understand where I'd apply the fuzzy value that relates to the relative similarity of lines that do not (exactly) match.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑距离基于将一个字符串转换为另一个字符串的“编辑脚本”的概念。它与用于比对 DNA 的 Needleman-Wunsch 算法 密切相关通过插入间隙字符来序列,我们使用动态规划在 O(nm) 时间内搜索最大化分数的比对。字符之间的精确匹配会增加分数,而不匹配或插入的间隙字符则会降低分数。
AACTTGCA
和AATGCGAT
的对齐示例:我们可以将顶部字符串视为“起始”序列,我们将其转换为底部的“最终”序列。底部包含
-
间隙字符的每一列是删除,顶部包含-
的每一列是插入,并且每一列具有不同的(非间隙)字符是替换。上面的比对中有 2 次删除、1 次插入和 1 次替换,因此编辑距离为 4。这是相同字符串的另一个比对,具有相同的编辑距离:
但请注意,虽然间隙数量相同,但编辑距离是 4。是少一个间隙区域。由于生物过程更有可能产生较宽的间隙而不是多个单独的间隙,因此生物学家更喜欢这种对齐方式 - 程序的用户也会如此。这是通过同时惩罚我们计算的分数中的间隙区域数量来实现的。 Gotoh 于 1982 年在一篇名为“An Improvement匹配生物序列的算法”。不幸的是,我找不到该论文的免费全文的任何链接——但是您可以通过谷歌搜索“序列比对”和“仿射间隙罚分”找到许多有用的教程。
一般来说,匹配、不匹配、间隙和间隙区域权重的不同选择将给出不同的对齐方式,但间隙区域的任何负分数将优先选择上面的底部对齐方式而不是顶部对齐方式。
所有这些与您的问题有什么关系?如果您对具有适当间隙惩罚的单个字符使用 Gotoh 算法(通过一些经验测试得出),您应该会发现像你给出的例子一样,有很多看起来很糟糕的对齐方式。
效率考虑
因素 理想情况下,您可以只对字符执行此操作并完全忽略行,因为仿射惩罚将尽可能将更改聚集到跨越多行的块中。但由于运行时间较长,对行进行第一次传递,然后对字符重新运行算法,使用所有不相同的行作为输入,可能更现实。在这种方案下,任何相同行的共享块都可以通过将其压缩为具有膨胀匹配权重的单个“字符”来处理,这有助于确保不会出现“交叉”。
Levenshtein distance is based on the notion of an "edit script" that transforms one string into another. It's very closely related to the Needleman-Wunsch algorithm used for aligning DNA sequences by inserting gap characters, in which we search for the alignment that maximises a score in O(nm) time using dynamic programming. Exact matches between characters increase the score, while mismatches or inserted gap characters reduce the score. An example alignment of
AACTTGCCA
andAATGCGAT
:We can think of the top string being the "starting" sequence that we are transforming into the "final" sequence on the bottom. Each column containing a
-
gap character on the bottom is a deletion, each column with a-
on the top is an insertion, and each column with different (non-gap) characters is a substitution. There are 2 deletions, 1 insertion and 1 substitution in the above alignment, so the Levenshtein distance is 4.Here is another alignment of the same strings, with the same Levenshtein distance:
But notice that although there are the same number of gaps, there is one less gap region. Because biological processes are more likely to create wide gaps than multiple separate gaps, biologists prefer this alignment -- and so will the users of your program. This is accomplished by also penalising the number of gap regions in the scores that we compute. An O(nm) algorithm to accomplish this for strings of lengths n and m was given by Gotoh in 1982 in a paper called "An improved algorithm for matching biological sequences". Unfortunately, I can't find any links to free full text of the paper -- but there are many useful tutorials that you can find by googling "sequence alignment" and "affine gap penalty".
In general, different choices of match, mismatch, gap and gap region weights will give different alignments, but any negative score for gap regions will prefer the bottom alignment above to the top one.
What does all this have to do with your problem? If you use Gotoh's algorithm on individual characters with a suitable gap penalty (arrived at with a few empirical tests), you should find a significant decrease in the the number of terrible-looking alignments like the example you gave.
Efficiency Considerations
Ideally, you could just do this on characters and ignore lines altogether, since the affine penalty will work to cluster changes into blocks spanning many lines wherever it can. But because of the higher running time, it may be more realistic to do a first pass on lines and then rerun the algorithm on characters, using as input all lines that are not identical. Under this scheme, any shared block of identical lines can be handled by compressing it into a single "character" with inflated matching weight, which helps to ensure no "crossings" appear.
确定后,使用相同算法来确定这两个缝隙中的哪些行相互匹配。但你需要稍微修改一下。当您使用该算法来匹配相等的行时,这些行可能匹配也可能不匹配,因此会向您使用的表格的单元格添加 0 或 1。
当比较一大块中的字符串时,其中一些字符串比其他字符串“更平等”(确认奥威尔)。因此,在考虑到目前为止哪个序列最匹配时,他们可以将 0 到 1 之间的实数添加到单元格中。
要计算此指标(从 0 到 1),您可以将其应用于您遇到的每一对字符串...对,再次使用相同的算法(实际上,您在执行第一次时已经这样做了) Levenstein 算法的通过)。这将计算 LCS 的长度,其与两个字符串的平均长度的比率将是度量值。
或者,您可以从 diff 工具之一借用该算法。例如,
vimdiff
可以突出显示您需要的匹配项。After you have determined it, use the same algorithm to determine what lines in these two chinks match each other. But you need to make slight modificaiton. When you used the algorithm to match equal lines, the lines could either match or not match, so that added either 0 or 1 to the cell of the table you used.
When comparing strings in one chunk some of them are "more equal" than others (ack. to Orwell). So they can add a real number from 0 to 1 to the cell when considering what sequence matches best so far.
To compute this metrics (from 0 to 1), you can apply to each pair of strings you encounter... right, the same algorithm again (actually, you already did this when you were doing the first pass of Levenstein algorithm). This will compute the length of LCS, whose ratio to the average length of two strings would be the the metric value.
Or, you can borrow the algorithm from one of diff tools. For instance,
vimdiff
can highlight the matches you require.这是其他人刚刚让我意识到的一个可能的解决方案:
我最初的方法是这样的:
虽然这可以在比较源代码修订时更好地直观地显示更改,但我现在发现更简单的方法通常就足够了。它的工作原理如下:
也许那些回答我最初问题的人认为我一直都知道这样做,但我非常关注每行比较,以至于我没有想到通过连接这些行来对这些行应用 LCS ,而不是逐行处理它们。
因此,虽然这种方法不会提供像我最初的意图那样详细的更改信息,但它仍然比我昨天写这个问题时开始的结果有所改善。
我会让这个问题再开放一段时间 - 也许其他人读完所有这些,仍然可以提供完整的答案(Pavel 和 random_hacker 提供了一些建议,但这还不是一个完整的解决方案 - 无论如何,感谢您的有用评论)。
Here's one possible solution someone else just made me realize:
My original approach was like this:
While this would allow for a better visual display of changes when comparing source code revisions, I now found that a much simpler approach is usually sufficient. It works like this:
Maybe those who replied to my original question assumed that I knew to do this all the time, but I had my focus so strongly on a per-line comparison that it did not occur to me to apply LCS on the set of lines by concatenating them, instead of processing them line-by-line.
So, while this approach will not provide as detailed change information as my original intent was, it still does improve the results over what I started yesterday with when I wrote this question.
I'll leave this question open for a while longer - maybe someone else, reading all this, can still provide a complete answer (Pavel and random_hacker offered some suggestions, but it's not a complete solution yet - anyway, thank you for the helpful comments).