将文本增量合并为单个“超字符串”的算法
我正在编写软件来跟踪作者对一本书的多个版本所做的更改。我已经编写了代码来生成一组描述两个版本之间差异的增量。
现在,我正在寻找一种算法来内联组合所有这些差异,以创建一个“超级字符串”,其中包含每个版本中插入和删除的所有文本。然后我想用有关添加和删除文本位置的信息来标记 HTML 中的字符串。
这样,我只需将不同的 CSS 属性应用于文档,就可以直观地看到文本之间的差异。
示例
如果作者以这种方式更改句子
-0- --1-- ---2--- ---3---
' ' -> 'cat' -> 'crate' -> 'crane'
我的代码会生成这些增量
0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>
,我想对其进行处理以创建如下文件:
<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>
问题
完成此任务的最佳算法是什么?这个问题有名字吗?
I'm writing software to track the changes an author made over several editions of a book. I already wrote code that produces a set of deltas describing differences between two editions.
Now I'm looking for an algorithm to combine all of these diffs inline to create a 'superstring' containing all of the text inserted and removed across every edition. Then I want to mark up the string in HTML with information about where the text was added and removed.
This way I can visualize the differences between the texts by simply applying different CSS attributes to the document.
Example
If the author changes a sentence in this way
-0- --1-- ---2--- ---3---
' ' -> 'cat' -> 'crate' -> 'crane'
My code produces these deltas
0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>
Which I want to process to create a file like this:
<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>
Question
What's the best algorithm to accomplish this task? Is there a name for this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您只需连接您的更改并跟踪它们的插入/删除时间。请注意,数字给出了字符串内的索引(并注意删除的字符不会增加索引)。
第 1 步:
0-1) <在 0 处插入“猫”>
[0] c 在第 1 步插入
[1 ] a 在步骤 1 插入
[2] t 在步骤 1 插入
步骤 2:
1-2) <在 1 处插入 'r'> ; <在第 3 处插入“e”>
[0] c 在第 1 步插入
[1] r 在第 2 步插入
<= 此处插入在此步骤中的位置 1[2] a 在步骤 1 插入
[3] t 在步骤 1 插入
[4] e 在步骤 2 插入
<= 这是在这里插入的在此步骤中的位置 3请注意,由于其他插入,“e”的位置实际上已移动到 4。
第 3 步:
2-3) <从 3 中删除> <在 3 处插入 'n'>
<= 我将此更改为[0] c
[1] r 在步骤 2 插入
[2] a 在第 1 步插入
[3] t 在第 1 步插入,在第 3 步删除
<= 不再计数,因此下一个索引是一样的[3] n inserted at step 3
<= this was inserted in this step at position 3所以基本算法是:
[ 字符列表,以及
在这两种情况下,请注意,此步骤中先前的插入/删除可能会改变当前操作的位置(轻松解决此问题的一种方法是从字符串末尾向后执行插入/删除)。
结果将是您在问题中指定的更改列表。对于很多更改,它可能会变得非常难以阅读,但它仍然会描述文本的完整历史。
You can just concatenate your changes and keep track of when they were inserted/removed. Note that the numbers give the index within the string (and note that chars removed do not increase index).
Step 1:
0-1) <insert 'cat' at 0>
[0] c inserted at step 1
[1] a inserted at step 1
[2] t inserted at step 1
Step 2:
1-2) <insert 'r' at 1> <insert 'e' at 3>
[0] c inserted at step 1
[1] r inserted at step 2
<= this was inserted here in this step at position 1[2] a inserted at step 1
[3] t inserted at step 1
[4] e inserted at step 2
<= this was inserted here in this step at position 3Note that the position of 'e' was actually shifted to 4 due to the other insertion.
Step 3:
2-3) <remove from 3> <insert 'n' at 3>
<= I changed this one to the minimal diff[0] c inserted at step 1
[1] r inserted at step 2
[2] a inserted at step 1
[3] t inserted at step 1, removed at step 3
<= does no longer count, thus the next index is the same[3] n inserted at step 3
<= this was inserted here in this step at position 3[4] e inserted at step 2
So the basic algorithm is:
In both cases note that previous insertions/removals within this step may shift the position of current action (one way to easily work around this is to do insertions/removals backwards from end of string to start).
The result will be a list of changes as you specified in your question. For lots of changes it might become quite unreadable but it will nevertheless describe the complete history of your text.