将文本增量合并为单个“超字符串”的算法

发布于 2024-11-17 10:18:02 字数 918 浏览 12 评论 0原文

我正在编写软件来跟踪作者对一本书的多个版本所做的更改。我已经编写了代码来生成一组描述两个版本之间差异的增量。

现在,我正在寻找一种算法来内联组合所有这些差异,以创建一个“超级字符串”,其中包含每个版本中插入和删除的所有文本。然后我想用有关添加和删除文本位置的信息来标记 HTML 中的字符串。

这样,我只需将不同的 CSS 属性应用于文档,就可以直观地看到文本之间的差异。

示例

如果作者以这种方式更改句子

-0-    --1--    ---2---    ---3---
' ' -> 'cat' -> 'crate' -> 'crane'

我的代码会生成这些增量

0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>

,我想对其进行处理以创建如下文件:

<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>

问题

完成此任务的最佳算法是什么?这个问题有名字吗?

I'm writing software to track the changes an author made over several editions of a book. I already wrote code that produces a set of deltas describing differences between two editions.

Now I'm looking for an algorithm to combine all of these diffs inline to create a 'superstring' containing all of the text inserted and removed across every edition. Then I want to mark up the string in HTML with information about where the text was added and removed.

This way I can visualize the differences between the texts by simply applying different CSS attributes to the document.

Example

If the author changes a sentence in this way

-0-    --1--    ---2---    ---3---
' ' -> 'cat' -> 'crate' -> 'crane'

My code produces these deltas

0-1) <insert 'cat' at 0>
1-2) <insert 'r' at 1> <insert 'e' at 3>
2-3) <remove from 3 to 4> <insert 'n' at 3>

Which I want to process to create a file like this:

<span class="inserted-1">c</span>
<span class="inserted-2">r</span>
<span class="inserted-1">a</span>
<span class="inserted-1 removed-3">t</span>
<span class="inserted-3">n</span>
<span class="inserted-2">e</span>

Question

What's the best algorithm to accomplish this task? Is there a name for this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

陈年往事 2024-11-24 10:18:02

您只需连接您的更改并跟踪它们的插入/删除时间。请注意,数字给出了字符串内的索引(并注意删除的字符不会增加索引)。

第 1 步: 0-1) <在 0 处插入“猫”>

  • [0] c 在第 1 步插入
  • [1 ] a 在步骤 1 插入
  • [2] t 在步骤 1 插入

步骤 2: 1-2) <在 1 处插入 'r'> ; <在第 3 处插入“e”>

  • [0] c 在第 1 步插入
  • [1] r 在第 2 步插入 <= 此处插入在此步骤中的位置 1
  • [2] a 在步骤 1 插入
  • [3] t 在步骤 1 插入
  • [4] e 在步骤 2 插入 <= 这是在这里插入的在此步骤中的位置 3

请注意,由于其他插入,“e”的位置实际上已移动到 4。

第 3 步: 2-3) <从 3 中删除> <在 3 处插入 'n'> <= 我将此更改为

  • 在步骤 1 插入的最小差异 [0] c
  • [1] r 在步骤 2 插入
  • [2] a 在第 1 步插入
  • [3] t 在第 1 步插入,在第 3 步删除 <= 不再计数,因此下一个索引是一样的
  • [3] n inserted at step 3 <= this was inserted in this step at position 3
  • 4] e inserted at step 2

所以基本算法是:

  • [ 字符列表,以及
  • 每个步骤的 插入步骤和删除步骤
    • 将这一步的差异分解为单个字符插入和删除
    • 要在位置 P 插入新字符 X,请执行以下操作:
      • 在列表中索引为 P 的最新字符之后插入新的字符 X,将插入步骤设置为当前步骤并调整后续项目的索引(即添加一个)。
    • 对于位置 P 处的删除 do
      • 通过将删除步骤设置为当前步骤并调整后面的索引(即减去一个),用索引 P 标记列表中的字符(其中只有一个仍然存在,即未设置删除步骤) )

在这两种情况下,请注意,此步骤中先前的插入/删除可能会改变当前操作的位置(轻松解决此问题的一种方法是从字符串末尾向后执行插入/删除)。

结果将是您在问题中指定的更改列表。对于很多更改,它可能会变得非常难以阅读,但它仍然会描述文本的完整历史。

You can just concatenate your changes and keep track of when they were inserted/removed. Note that the numbers give the index within the string (and note that chars removed do not increase index).

Step 1: 0-1) <insert 'cat' at 0>

  • [0] c inserted at step 1
  • [1] a inserted at step 1
  • [2] t inserted at step 1

Step 2: 1-2) <insert 'r' at 1> <insert 'e' at 3>

  • [0] c inserted at step 1
  • [1] r inserted at step 2 <= this was inserted here in this step at position 1
  • [2] a inserted at step 1
  • [3] t inserted at step 1
  • [4] e inserted at step 2 <= this was inserted here in this step at position 3

Note that the position of 'e' was actually shifted to 4 due to the other insertion.

Step 3: 2-3) <remove from 3> <insert 'n' at 3> <= I changed this one to the minimal diff

  • [0] c inserted at step 1
  • [1] r inserted at step 2
  • [2] a inserted at step 1
  • [3] t inserted at step 1, removed at step 3 <= does no longer count, thus the next index is the same
  • [3] n inserted at step 3 <= this was inserted here in this step at position 3
  • [4] e inserted at step 2

So the basic algorithm is:

  • maintain a list of characters, together with step of insertion and step of removal
  • for each step do
    • decompose your diff from this step into single char inserts and removals
    • for a insert of a new char X at position P do the following:
      • insert the new char X after the latest char in your list with index P, set step of insertion to current step and adjust indices of following items (i.e. add one).
    • for a deletion at a position P do
      • mark the char in your list with index P (there is only one of them which still is present, i.e. where the removal step is not set) by setting the removal step to the current step and adjust later indices (i.e. subtract one)

In both cases note that previous insertions/removals within this step may shift the position of current action (one way to easily work around this is to do insertions/removals backwards from end of string to start).

The result will be a list of changes as you specified in your question. For lots of changes it might become quite unreadable but it will nevertheless describe the complete history of your text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文