如何在块排序中对数组后缀进行排序
我正在阅读 Burrows 和 Wheeler 论文中的块排序算法。 这是算法的一个步骤:
假设 S= abracadabra
初始化一个由 N 个单词 W[0, ... , N - 1] 组成的数组 W,使得 W[i] 包含字符 S'[i, .. . , i + k - 1] 的排列使得单词的整数比较与 k 字符字符串的字典比较一致。将字符打包成单词有两个好处:它允许使用对齐的内存访问一次比较两个前缀 k 个字节,并且它允许消除许多缓慢的情况
(注意:S'
是原始的 S
,附加了 k 个 EOF
字符,k 是适合机器字的字符数(我在 32 位机器中,所以 < code>k=4)
EOF = '$'
请纠正我,如果我错了:
S'= abracadabra$$$$
W= abra brac raca acad cada adab dabr abra bra$ ra$$ a$$$
然后,算法说你必须通过索引对 S
(名为 V)的后缀数组进行排序 数组W
。
我不完全理解如何通过索引到 W
来对后缀进行排序。 例如:在排序的某个时刻,假设您得到两个后缀,i
和 j
,并且您必须对它们进行比较。由于您正在索引 W
,因此您一次检查 4 个字符。
假设它们的前 4 个字符相同。然后,您必须检查每个后缀接下来的 4 个字符,并且可以通过从 W
中每个后缀的第 4 个位置进行访问来完成此操作。 这是对的吗?这种“将字符打包成单词”真的能加快速度吗?
I'm reading the block sort algorithm from the Burrows and Wheeler paper.
This a step of the algorithm:
Suppose S= abracadabra
Initialize an array W of N words W[0, ... , N - 1], such that W[i] contains the characters S'[i, ... , i + k - 1] arranged so that integer comparisons on the words agree with lexicographic comparisons on the k-character strings. Packing characters into words has two benefits: It allows two prefixes to be compared k bytes at a time using aligned memory accesses, and it allows many slow cases to be eliminated
(Note: S'
is the original S
with k EOF
characters appended to it, k being the number of characters that fit in a machine word (I'm in a 32 bits machine, so k=4
)
EOF = '
Correct me if I'm wrong:
S'= abracadabra$$
W= abra brac raca acad cada adab dabr abra bra$ ra$ a$$
Then, the algorithm says you have to sort the suffix array of S
(named V), by indexing into
the array W
.
I don't fully understand how can you sort suffixes by indexing into W
.
For example: at some point of the sorting, suppose you get two suffixes, i
and j
, and you have to compare them. Since you are indexing into W
, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W
.
Is this right? Does this "packing characters into words" really speed things up?
Correct me if I'm wrong:
Then, the algorithm says you have to sort the suffix array of S
(named V), by indexing into
the array W
.
I don't fully understand how can you sort suffixes by indexing into W
.
For example: at some point of the sorting, suppose you get two suffixes, i
and j
, and you have to compare them. Since you are indexing into W
, you are checking 4 characters at the time.
Suppose they have both the same first 4 characters. Then, you would have to check, for each suffix their next 4 characters, and you do it by accessing from the 4th position of each suffix in W
.
Is this right? Does this "packing characters into words" really speed things up?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您在问题中描述它的方式是完全准确的。是的,它加快了速度,因为就像你说的,它一次比较四个字符。
不过,有两点需要注意:
The way you describe it in the question is entirely accurate. And yes, it speeds things up because, like you said, it compares four characters at a time.
There are two remarks to be made, though:
数组 V 不是后缀数组,而是 W 的索引数组。排序完成后,V 应该保存 W 的索引,这样的话
我
希望我说的是对的:)让它们完全匹配不是问题,并且任何一个顺序都可以。关键是,当您应用反向转换时,您需要能够恢复 W 才能恢复原始字符串,并且 W 的相同元素不会导致问题。
The array V is not a suffix array, but an array of indices into W. Once the sorting is complete, V should hold indices into W such that if
then
I hope I said that right :) Having them EXACTLY match is not an issue and either order is fine. The point is that when you apply the reverse transformation you need to be able to recover W in order to recover the original string, and identical elements of W are not going to cause an problem with that.