当前位置：文江博客话题详情

就地基数排序

发布于 2024-07-12 22:32:51 字数 2204 浏览 11 评论 0原文

这是一篇很长的文字。请多多包涵。归根结底，问题是：是否存在可行的就地基数排序算法？

初步

我有大量小型固定长度字符串，它们仅使用字母“A”、“C”、“G”和“T”（是的，你已经猜到了： DNA) 我想要排序。

目前，我使用 std::sort ，它使用 introsort STL 的所有常见实现。这非常有效。但是，我确信基数排序完全适合我的问题集并且应该有效在实践中更好。

详细信息

我已经用一个非常简单的实现测试了这个假设，对于相对较小的输入（大约 10,000），这是正确的（嗯，至少快两倍以上）。然而，当问题规模变大（N > 5,000,000）时，运行时间会大大降低。

原因很明显：基数排序需要复制整个数据（实际上，在我的幼稚实现中不止一次）。这意味着我已将 ~ 4 GiB 放入主内存中，这显然会降低性能。即使没有，我也无法使用这么多内存，因为问题规模实际上变得更大。

理想情况

下，该算法应适用于 2 到 100 之间的任何字符串长度，对于 DNA 以及 DNA5（允许附加通配符“N”），甚至是具有 IUPAC 歧义代码（产生 16 个不同的值）。但是，我意识到无法涵盖所有这些情况，因此我对速度的提高感到满意。代码可以动态决定分派到哪个算法。

研究

不幸的是，关于基数排序的维基百科文章毫无用处。关于就地变体的部分完全是垃圾。 NIST-DADS 关于基数排序的部分几乎不存在。有一篇听起来很有前途的论文，名为高效自适应就地基数排序，其中描述了算法“MSL”。不幸的是，这篇论文也令人失望。

具体来说，有以下几件事。

首先，该算法包含多个错误，并且有很多无法解释的地方。特别是，它没有详细说明递归调用（我只是假设它增加或减少一些指针来计算当前的移位和掩码值）。此外，它使用函数 dest_group 和 dest_address 而不给出定义。我不知道如何有效地实现这些（即，在 O(1) 中；至少 dest_address 并不是微不足道的）。

最后但并非最不重要的一点是，该算法通过将数组索引与输入数组内的元素交换来实现就地性。这显然只适用于数值数组。我需要在字符串上使用它。当然，我可以只使用强类型并继续假设内存能够容忍我将索引存储在不属于它的地方。但这仅在我可以将字符串压缩到 32 位内存中（假设 32 位整数）时才有效。这只有 16 个字符（我们暂时忽略 16 > log(5,000,000)）。

其中一位作者的另一篇论文根本没有给出准确的描述，但它给出了 MSL 的运行时间为次线性，这完全是错误的。

回顾一下：是否有希望找到一个可行的参考实现，或者至少找到一个适用于 DNA 字符串的就地基数排序的良好伪代码/描述？

原文

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?

Preliminary

I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.

Details

I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.

Research

Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn’t trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.

To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亢潮 2024-07-19 22:32:52

如果您的数据集太大，那么我认为基于磁盘的缓冲区方法是最好的：

sort(List<string> elements, int prefix)
    if (elements.Count < THRESHOLD)
         return InMemoryRadixSort(elements, prefix)
    else
         return DiskBackedRadixSort(elements, prefix)

DiskBackedRadixSort(elements, prefix)
    DiskBackedBuffer<string>[] buckets
    foreach (element in elements)
        buckets[element.MSB(prefix)].Add(element);

    List<string> ret
    foreach (bucket in buckets)
        ret.Add(sort(bucket, prefix + 1))

    return ret

我还会尝试分组到更多数量的存储桶中，例如，如果您的字符串是：

GATTACA

第一个 MSB 调用将返回GATT 的存储桶（总共 256 个存储桶），这样您就可以减少基于磁盘的缓冲区的分支。这可能会也可能不会提高性能，所以请尝试一下。

If your data set is so big, then I would think that a disk-based buffer approach would be best:

sort(List<string> elements, int prefix)
    if (elements.Count < THRESHOLD)
         return InMemoryRadixSort(elements, prefix)
    else
         return DiskBackedRadixSort(elements, prefix)

DiskBackedRadixSort(elements, prefix)
    DiskBackedBuffer<string>[] buckets
    foreach (element in elements)
        buckets[element.MSB(prefix)].Add(element);

    List<string> ret
    foreach (bucket in buckets)
        ret.Add(sort(bucket, prefix + 1))

    return ret

I would also experiment grouping into a larger number of buckets, for instance, if your string was:

GATTACA

the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.

就地基数排序

初步

详细信息

理想情况

研究

Preliminary

Details

Use Cases

Research

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（17）

编辑：

Edit:

关于作者

相关话题

热门标签

推荐作者

linfzu01

§对你不离不弃

可遇━不可求

枕梦

qq_3LFa8Q

JP

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。