如何在块排序中对数组后缀进行排序

发布于 2024-11-15 11:25:35 字数 883 浏览 2 评论 0原文

我正在阅读 Burrows 和 Wheeler 论文中的块排序算法。这是算法的一个步骤：

假设 S= abracadabra

初始化一个由 N 个单词 W[0, ... , N - 1] 组成的数组 W，使得 W[i] 包含字符 S'[i, .. . , i + k - 1] 的排列使得单词的整数比较与 k 字符字符串的字典比较一致。将字符打包成单词有两个好处：它允许使用对齐的内存访问一次比较两个前缀 k 个字节，并且它允许消除许多缓慢的情况

（注意：S'是原始的 S ，附加了 k 个 EOF 字符，k 是适合机器字的字符数（我在 32 位机器中，所以 < code>k=4)

EOF = '$'

请纠正我，如果我错了：

S'= abracadabra$$$$  
W= abra brac raca acad cada adab dabr abra bra$ ra$$ a$$$

然后，算法说你必须通过索引对 S（名为 V）的后缀数组进行排序数组W。

我不完全理解如何通过索引到 W 来对后缀进行排序。例如：在排序的某个时刻，假设您得到两个后缀，i 和 j，并且您必须对它们进行比较。由于您正在索引 W，因此您一次检查 4 个字符。
假设它们的前 4 个字符相同。然后，您必须检查每个后缀接下来的 4 个字符，并且可以通过从 W 中每个后缀的第 4 个位置进行访问来完成此操作。这是对的吗？这种“将字符打包成单词”真的能加快速度吗？

原文

I'm reading the block sort algorithm from the Burrows and Wheeler paper.
This a step of the algorithm:

Suppose S= abracadabra

Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k-character strings. Packing characters into words has two benefits: It allows two prefixes to be compared k bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated

(Note: S' is the original S with k EOF characters appended to it, k being the number of characters that fit in a machine word (I'm in a 32 bits machine, so k=4)

EOF = '
Correct me if I'm wrong:
S'= abracadabra$$  
W= abra brac raca acad cada adab dabr abra bra$ ra$ a$$

Then, the algorithm says you have to sort the suffix array of S (named V), by indexing into

the array W.
I don't fully understand how can you sort suffixes by indexing into W.

For example: at some point of the sorting, suppose you get two suffixes, i and j, and you have to compare them. Since you are indexing into W, you are checking 4 characters at the time.

 Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W.

Is this right? Does this "packing characters into words" really speed things up?

Correct me if I'm wrong:

Then, the algorithm says you have to sort the suffix array of S (named V), by indexing into
the array W.

I don't fully understand how can you sort suffixes by indexing into W.
For example: at some point of the sorting, suppose you get two suffixes, i and j, and you have to compare them. Since you are indexing into W, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W.
Is this right? Does this "packing characters into words" really speed things up?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜夜流光相皎洁 2024-11-22 11:25:35

您在问题中描述它的方式是完全准确的。是的，它加快了速度，因为就像你说的，它一次比较四个字符。

不过，有两点需要注意：

当您比较后缀 i 和 j 时，就像在您的示例中一样，您确实比较了条目 W[i] 和 W[j]。其结果与按字典顺序比较四元组字符 S[i..i+3] 和 S[j..j+3] 的结果相同，因此您节省的计算时间相当于三个字符比较的时间。是的，如果结果表明两个四元组相同，您必须继续比较 W[i+1] 和 W[j+1]，但是：您不会立即这样做。他们的算法的工作方式是基数排序。也就是说，您在初始比较后立即将后缀放入存储桶中（可能都放入同一个存储桶中），然后以递归方式在内部对存储桶进行排序。
Burrows 和 Wheeler 的原始论文中描述的算法（您从中引用；有一个副本这里为例），这是1994年的，不是最优的后缀数组构造算法。首先，2003年发现了几种O(N)直接构造方法；其次，自那时以来，实施工作取得了许多进一步的改进。 1994 年论文的核心思想是使用 Burrows-Wheeler 变换作为字符串压缩的基础，而不是变换本身生成的确切方式。

回复收藏 0 原文

趁年轻赶紧闹 2024-11-22 11:25:35

数组 V 不是后缀数组，而是 W 的索引数组。排序完成后，V 应该保存 W 的索引，这样的话

V[i] <= V[j]

我

 W[V[i]] <= W[V[j]].

希望我说的是对的:)让它们完全匹配不是问题，并且任何一个顺序都可以。关键是，当您应用反向转换时，您需要能够恢复 W 才能恢复原始字符串，并且 W 的相同元素不会导致问题。

The array V is not a suffix array, but an array of indices into W. Once the sorting is complete, V should hold indices into W such that if

V[i] <= V[j]

then

 W[V[i]] <= W[V[j]].

I hope I said that right :) Having them EXACTLY match is not an issue and either order is fine. The point is that when you apply the reverse transformation you need to be able to recover W in order to recover the original string, and identical elements of W are not going to cause an problem with that.

回复收藏 0 原文

~没有更多了~