如何在块排序中对数组后缀进行排序

发布于 2024-11-15 11:25:35 字数 883 浏览 2 评论 0原文

我正在阅读 Burrows 和 Wheeler 论文中的块排序算法。 这是算法的一个步骤:

假设 S= abracadabra

初始化一个由 N 个单词 W[0, ... , N - 1] 组成的数组 W,使得 W[i] 包含字符 S'[i, .. . , i + k - 1] 的排列使得单词的整数比较与 k 字符字符串的字典比较一致。将字符打包成单词有两个好处:它允许使用对齐的内存访问一次比较两个前缀 k 个字节,并且它允许消除许多缓慢的情况

(注意:S'是原始的 S ,附加了 k 个 EOF 字符,k 是适合机器字的字符数(我在 32 位机器中,所以 < code>k=4)

EOF = '$'

请纠正我,如果我错了:

S'= abracadabra$$$$  
W= abra brac raca acad cada adab dabr abra bra$ ra$$ a$$$

然后,算法说你必须通过索引对 S(名为 V)的后缀数组进行排序 数组W

我不完全理解如何通过索引到 W 来对后缀进行排序。 例如:在排序的某个时刻,假设您得到两个后缀,ij,并且您必须对它们进行比较。由于您正在索引 W,因此您一次检查 4 个字符。
假设它们的前 4 个字符相同。然后,您必须检查每个后缀接下来的 4 个字符,并且可以通过从 W 中每个后缀的第 4 个位置进行访问来完成此操作。 这是对的吗?这种“将字符打包成单词”真的能加快速度吗?

I'm reading the block sort algorithm from the Burrows and Wheeler paper.
This a step of the algorithm:

Suppose S= abracadabra

Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k-character strings. Packing characters into words has two benefits: It allows two prefixes to be compared k bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated

(Note: S' is the original S with k EOF characters appended to it, k being the number of characters that fit in a machine word (I'm in a 32 bits machine, so k=4)

EOF = '

Correct me if I'm wrong:

S'= abracadabra$$  
W= abra brac raca acad cada adab dabr abra bra$ ra$ a$$

Then, the algorithm says you have to sort the suffix array of S (named V), by indexing into
the array W
.

I don't fully understand how can you sort suffixes by indexing into W.
For example: at some point of the sorting, suppose you get two suffixes, i and j, and you have to compare them. Since you are indexing into W, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W.
Is this right? Does this "packing characters into words" really speed things up?

Correct me if I'm wrong:

Then, the algorithm says you have to sort the suffix array of S (named V), by indexing into
the array W
.

I don't fully understand how can you sort suffixes by indexing into W.
For example: at some point of the sorting, suppose you get two suffixes, i and j, and you have to compare them. Since you are indexing into W, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W.
Is this right? Does this "packing characters into words" really speed things up?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜夜流光相皎洁 2024-11-22 11:25:35

您在问题中描述它的方式是完全准确的。是的,它加快了速度,因为就像你说的,它一次比较四个字符。

不过,有两点需要注意:

  1. 当您比较后缀 i 和 j 时,就像在您的示例中一样,您确实比较了条目 W[i] 和 W[j]。其结果与按字典顺序比较四元组字符 S[i..i+3] 和 S[j..j+3] 的结果相同,因此您节省的计算时间相当于三个字符比较的时间。是的,如果结果表明两个四元组相同,您必须继续比较 W[i+1] 和 W[j+1],但是:您不会立即这样做。他们的算法的工作方式是基数排序。也就是说,您在初始比较后立即将后缀放入存储桶中(可能都放入同一个存储桶中),然后以递归方式在内部对存储桶进行排序。
  2. Burrows 和 Wheeler 的原始论文中描述的算法(您从中引用;有一个副本 这里为例),这是1994年的,不是最优的后缀数组构造算法。首先,2003年发现了几种O(N)直接构造方法;其次,自那时以来,实施工作取得了许多进一步的改进。 1994 年论文的核心思想是使用 Burrows-Wheeler 变换作为字符串压缩的基础,而不是变换本身生成的确切方式。

The way you describe it in the question is entirely accurate. And yes, it speeds things up because, like you said, it compares four characters at a time.

There are two remarks to be made, though:

  1. When you compare suffixes i and j, like in your example, you compare entries W[i] and W[j] indeed. The result of this is the same as if you had lexicographically compared the quadruple of characters S[i..i+3] and S[j..j+3], so you have saved computing time equivalent to three character comparisons. And yes, if the result indicates that the two quadruples are identical, you have to continue comparing W[i+1] and W[j+1], however: You don't do it right away. The way their algorithm works is that of a radix sort. That is, you put the suffixes into buckets right after the initial comparison (possibly both into the same bucket), and then internally sort the buckets, recursively.
  2. The algorithm described in the original paper by Burrows and Wheeler (from which you cite; there is a copy here for example), which is from 1994, is not the optimal suffix array construction algorithm. Firstly, in 2003 several O(N), direct construction methods were discovered; secondly, since then, many further improvements to the implementation were made. The core of the 1994 paper is the idea of using the Burrows-Wheeler transform as basis for string compression, not the exact way the transform itself is generated.
趁年轻赶紧闹 2024-11-22 11:25:35

数组 V 不是后缀数组,而是 W 的索引数组。排序完成后,V 应该保存 W 的索引,这样的话

V[i] <= V[j]

 W[V[i]] <= W[V[j]].

希望我说的是对的:)让它们完全匹配不是问题,并且任何一个顺序都可以。关键是,当您应用反向转换时,您需要能够恢复 W 才能恢复原始字符串,并且 W 的相同元素不会导致问题。

The array V is not a suffix array, but an array of indices into W. Once the sorting is complete, V should hold indices into W such that if

V[i] <= V[j]

then

 W[V[i]] <= W[V[j]].

I hope I said that right :) Having them EXACTLY match is not an issue and either order is fine. The point is that when you apply the reverse transformation you need to be able to recover W in order to recover the original string, and identical elements of W are not going to cause an problem with that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文