最好的基于（单词或字符）的差异算法是什么？

发布于 2024-12-20 01:20:59 字数 524 浏览 1 评论 0原文

因此，我希望能够在每个单词的基础上找到两个字符串之间的差异（可能比每个字符更快，但是，如果每个字符更快，那么我想这样做）。

这是我想要实现的目标的示例：源文本：

Hello there!

修改后的文本：

Helay scere?

diff：

Hel[lo](ay) [th](sc)ere[!](?)

括号内的文本是被删除的文本，括号内的文本是添加的文本，

有一种超级黑客的方法可以使用命令行工具来执行此操作，例如 opendiff，但它需要在每个字符之间有一个换行符，因为 opendiff 是基于行的。

我正在使用 ruby，并且还没有找到任何工具来执行此操作...但是语言并不是非常重要，因为算法可以很容易地移植。

谢谢。

原文

So, I want to be able to find the diff between two strings on a per-word basis (maybe faster than per-character, though, if per-character is faster then I'd want to do it that way).

Here is an example of what I want to achieve:
Source Text:

Hello there!

Modified Text:

Helay scere?

diff:

Hel[lo](ay) [th](sc)ere[!](?)

the bracketed text is what was removed, the parenthetical text is what was added

there is kind of a super hackish way to do this using a commandline tool, such as opendiff, but it requires a newline character inbetween every character, as opendiff is line-based.

I'm using ruby, and haven't found any tools to do this... but language isn't terribly important, as algorithms can be ported pretty easily.

thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东京女 2024-12-27 01:20:59

您可能需要检查：http://en.wikipedia.org/wiki/Longest_common_subsequence_problem。实施起来并不难。

回复收藏 0 原文

丢了幸福的猪 2024-12-27 01:20:59

请查看 https://github.com/pvande/differ。这颗宝石满足您的需求

回复收藏 0 原文

云朵有点甜 2024-12-27 01:20:59

因此，您可以重复使用 LCS（如上面链接）来查找所有常见字符串，并将它们从两个字符串中删除，并用其他字符串替换它们 - 让我们只说一个“*”。然后，您同时迭代两个字符串，并将共同的和不同的重新网格化在一起。

示例

A) Hello there!
B) Helay scere?

LCS detection gives us ["Hel"," ","ere"], and after replacement we have
A) *lo*th*!
B) *ay*sc*?

Now you split on the delimiter ("*") giving you
A) ["lo","th","!"]
B) ["ay","sc","?"]

从这里开始，您只需进行简单的网格划分。需要注意的关键是可能存在空条目，例如，如果您在“Hell”和“Hel”上执行此方法，您最终会得到

Common LCS) ["Hel"]
A) ["l"]
B) [""]

meaning your result will be Hel[l]()

希望这是可以接受的。

So what you can do repeatedly use the LCS (as linked above) to find all the common strings, and remove them from both your strings, replacing them with some other string - let's just say a "*". Then you iterate through both strings at the same time, and re-mesh the common and distinct back together.

Example

A) Hello there!
B) Helay scere?

LCS detection gives us ["Hel"," ","ere"], and after replacement we have
A) *lo*th*!
B) *ay*sc*?

Now you split on the delimiter ("*") giving you
A) ["lo","th","!"]
B) ["ay","sc","?"]

And from here you've just go to do simple meshing. Something key to note is that there may be null entries, for example if you are doing this method on "Hell" and "Hel" you'll eventually get

Common LCS) ["Hel"]
A) ["l"]
B) [""]

meaning your result will be Hel[l]()

Hopefully that is acceptable.

回复收藏 0 原文

冬天旳寂寞 2024-12-27 01:20:59

这是一个对字符串进行比较的 ruby gem： http://rubydoc.info /gems/diff-lcs/1.1.3/frames

之前，我刚刚做了（在 irb 中）

require 'rubygems'
require 'diff/lcs'
require 'diff/lcs/array'
require 'diff/lcs/string'

在此处输入图像描述

因此，由于这个 2D diff 更改数组，编写插入、内联删除和插入标记的逻辑变得微不足道。

虽然我不确定这是否是最好的方法。

Here is a ruby gem that does diffing of strings: http://rubydoc.info/gems/diff-lcs/1.1.3/frames

Before hand, I just did (in irb)

require 'rubygems'
require 'diff/lcs'
require 'diff/lcs/array'
require 'diff/lcs/string'

enter image description here

So, writing the logic to insert, inline removed and inserted markers becomes trivial thanks to this 2D diff array of changes.

Though I'm not sure if this is this the best way.

回复收藏 0 原文

许仙没带伞 2024-12-27 01:20:59

解决方案是找到字符串之间的编辑距离。

回复收藏 0 原文

~没有更多了~

关于作者

一杆小烟枪

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

最好的基于（单词或字符）的差异算法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

最好的基于（单词或字符）的差异算法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。