最好的基于(单词或字符)的差异算法是什么?
因此,我希望能够在每个单词的基础上找到两个字符串之间的差异(可能比每个字符更快,但是,如果每个字符更快,那么我想这样做)。
这是我想要实现的目标的示例: 源文本:
Hello there!
修改后的文本:
Helay scere?
diff:
Hel[lo](ay) [th](sc)ere[!](?)
- 括号内的文本是被删除的文本,括号内的文本是添加的文本,
有一种超级黑客的方法可以使用命令行工具来执行此操作,例如 opendiff,但它需要在每个字符之间有一个换行符,因为 opendiff 是基于行的。
我正在使用 ruby,并且还没有找到任何工具来执行此操作...但是语言并不是非常重要,因为算法可以很容易地移植。
谢谢。
So, I want to be able to find the diff between two strings on a per-word basis (maybe faster than per-character, though, if per-character is faster then I'd want to do it that way).
Here is an example of what I want to achieve:
Source Text:
Hello there!
Modified Text:
Helay scere?
diff:
Hel[lo](ay) [th](sc)ere[!](?)
- the bracketed text is what was removed, the parenthetical text is what was added
there is kind of a super hackish way to do this using a commandline tool, such as opendiff, but it requires a newline character inbetween every character, as opendiff is line-based.
I'm using ruby, and haven't found any tools to do this... but language isn't terribly important, as algorithms can be ported pretty easily.
thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可能需要检查:http://en.wikipedia.org/wiki/Longest_common_subsequence_problem。实施起来并不难。
You may want to check this: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem. It's not hard to implement.
请查看 https://github.com/pvande/differ。这颗宝石满足您的需求
Have a look to https://github.com/pvande/differ. This gem does what you are looking for
因此,您可以重复使用 LCS(如上面链接)来查找所有常见字符串,并将它们从两个字符串中删除,并用其他字符串替换它们 - 让我们只说一个“*”。然后,您同时迭代两个字符串,并将共同的和不同的重新网格化在一起。
示例
从这里开始,您只需进行简单的网格划分。需要注意的关键是可能存在空条目,例如,如果您在“Hell”和“Hel”上执行此方法,您最终会得到
希望这是可以接受的。
So what you can do repeatedly use the LCS (as linked above) to find all the common strings, and remove them from both your strings, replacing them with some other string - let's just say a "*". Then you iterate through both strings at the same time, and re-mesh the common and distinct back together.
Example
And from here you've just go to do simple meshing. Something key to note is that there may be null entries, for example if you are doing this method on "Hell" and "Hel" you'll eventually get
Hopefully that is acceptable.
这是一个对字符串进行比较的 ruby gem: http://rubydoc.info /gems/diff-lcs/1.1.3/frames
之前,我刚刚做了(在 irb 中)
因此,由于这个 2D diff 更改数组,编写插入、内联删除和插入标记的逻辑变得微不足道。
虽然我不确定这是否是最好的方法。
Here is a ruby gem that does diffing of strings: http://rubydoc.info/gems/diff-lcs/1.1.3/frames
Before hand, I just did (in irb)
So, writing the logic to insert, inline removed and inserted markers becomes trivial thanks to this 2D diff array of changes.
Though I'm not sure if this is this the best way.
解决方案是找到字符串之间的编辑距离。
A solution will be to find the edit distance between the strings.