就地基数排序
这是一篇很长的文字。 请多多包涵。 归根结底,问题是:是否存在可行的就地基数排序算法?
初步
我有大量小型固定长度字符串,它们仅使用字母“A”、“C”、“G”和“T”(是的,你已经猜到了: DNA) 我想要排序。
目前,我使用 std::sort
,它使用 introsort STL 的所有常见实现。 这非常有效。 但是,我确信基数排序完全适合我的问题集并且应该有效 在实践中更好。
详细信息
我已经用一个非常简单的实现测试了这个假设,对于相对较小的输入(大约 10,000),这是正确的(嗯,至少快两倍以上)。 然而,当问题规模变大(N > 5,000,000)时,运行时间会大大降低。
原因很明显:基数排序需要复制整个数据(实际上,在我的幼稚实现中不止一次)。 这意味着我已将 ~ 4 GiB 放入主内存中,这显然会降低性能。 即使没有,我也无法使用这么多内存,因为问题规模实际上变得更大。
理想情况
下,该算法应适用于 2 到 100 之间的任何字符串长度,对于 DNA 以及 DNA5(允许附加通配符“N”),甚至是具有 IUPAC 歧义代码 (产生 16 个不同的值)。 但是,我意识到无法涵盖所有这些情况,因此我对速度的提高感到满意。 代码可以动态决定分派到哪个算法。
研究
不幸的是,关于基数排序的维基百科文章毫无用处。 关于就地变体的部分完全是垃圾。 NIST-DADS 关于基数排序的部分几乎不存在。 有一篇听起来很有前途的论文,名为高效自适应就地基数排序,其中描述了算法“MSL”。 不幸的是,这篇论文也令人失望。
具体来说,有以下几件事。
首先,该算法包含多个错误,并且有很多无法解释的地方。 特别是,它没有详细说明递归调用(我只是假设它增加或减少一些指针来计算当前的移位和掩码值)。 此外,它使用函数 dest_group
和 dest_address
而不给出定义。 我不知道如何有效地实现这些(即,在 O(1) 中;至少 dest_address
并不是微不足道的)。
最后但并非最不重要的一点是,该算法通过将数组索引与输入数组内的元素交换来实现就地性。 这显然只适用于数值数组。 我需要在字符串上使用它。 当然,我可以只使用强类型并继续假设内存能够容忍我将索引存储在不属于它的地方。 但这仅在我可以将字符串压缩到 32 位内存中(假设 32 位整数)时才有效。 这只有 16 个字符(我们暂时忽略 16 > log(5,000,000))。
其中一位作者的另一篇论文根本没有给出准确的描述,但它给出了 MSL 的运行时间为次线性,这完全是错误的。
回顾一下:是否有希望找到一个可行的参考实现,或者至少找到一个适用于 DNA 字符串的就地基数排序的良好伪代码/描述?
This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm?
Preliminary
I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.
At the moment, I use std::sort
which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.
Details
I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).
The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.
Use Cases
Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.
Research
Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.
In particular, there are the following things.
First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn’t detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group
and dest_address
without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address
isn’t trivial).
Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn’t belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).
Another paper by one of the authors gives no accurate description at all, but it gives MSL’s runtime as sub-linear which is flat out wrong.
To recap: Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(17)
如果您的数据集太大,那么我认为基于磁盘的缓冲区方法是最好的:
我还会尝试分组到更多数量的存储桶中,例如,如果您的字符串是:
第一个 MSB 调用将返回GATT 的存储桶(总共 256 个存储桶),这样您就可以减少基于磁盘的缓冲区的分支。 这可能会也可能不会提高性能,所以请尝试一下。
If your data set is so big, then I would think that a disk-based buffer approach would be best:
I would also experiment grouping into a larger number of buckets, for instance, if your string was:
the first MSB call would return the bucket for GATT (256 total buckets), that way you make fewer branches of the disk based buffer. This may or may not improve performance, so experiment with it.
我将冒险建议您切换到堆/heapsort 实现。 此建议带有一些假设:
堆/堆排序的优点在于,您可以在读取数据时构建堆,并且在构建堆后就可以开始获取结果。
让我们退后一步。 如果你很幸运,你可以异步读取数据(也就是说,你可以发出某种读取请求,并在某些数据准备好时收到通知),然后你可以在等待时构建一块堆。下一个要输入的数据块 - 甚至来自磁盘。 通常,这种方法可以将一半的排序成本隐藏在获取数据所花费的时间之后。
读取数据后,第一个元素就已经可用。 根据您发送数据的位置,这可能很棒。 如果您要将其发送到另一个异步读取器,或某些并行“事件”模型或 UI,则可以随时发送块和块。
也就是说,如果您无法控制数据的读取方式,并且数据是同步读取的,并且在排序后的数据完全写出之前您没有任何用处,请忽略所有这些。 :(
请参阅维基百科文章:
I'm going to go out on a limb and suggest you switch to a heap/heapsort implementation. This suggestion comes with some assumptions:
The beauty of the heap/heap-sort is that you can build the heap while you read the data, and you can start getting results the moment you have built the heap.
Let's step back. If you are so fortunate that you can read the data asynchronously (that is, you can post some kind of read request and be notified when some data is ready), and then you can build a chunk of the heap while you are waiting for the next chunk of data to come in - even from disk. Often, this approach can bury most of the cost of half of your sorting behind the time spent getting the data.
Once you have the data read, the first element is already available. Depending on where you are sending the data, this can be great. If you are sending it to another asynchronous reader, or some parallel 'event' model, or UI, you can send chunks and chunks as you go.
That said - if you have no control over how the data is read, and it is read synchronously, and you have no use for the sorted data until it is entirely written out - ignore all this. :(
See the Wikipedia articles:
“基数排序没有额外空间”是一篇解决您的问题的论文。
"Radix sorting with no extra space" is a paper addressing your problem.
在性能方面,您可能需要查看更通用的字符串比较排序算法。
目前,您最终会接触到每根弦的每个元素,但您可以做得更好!
特别是,突发排序非常适合这种情况。 作为一个额外的好处,由于burstsort是基于尝试的,所以它对于DNA/RNA中使用的小字母大小来说效果非常好,因为你不需要在结构中构建任何类型的三元搜索节点、散列或其他trie节点压缩方案。特里树的实现。 这些尝试也可能对您的后缀数组式最终目标有用。
一个不错的burstsort通用实现可以在source forge上找到 http://sourceforge.net/projects/burstsort/ - 但它没有就位。
出于比较目的,http 中介绍了 C-burstsort 实现: //www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf 基准测试比某些典型工作负载的快速排序和基数排序快 4-5 倍。
Performance-wise you might want to look at a more general string-comparison sorting algorithms.
Currently you wind up touching every element of every string, but you can do better!
In particular, a burst sort is a very good fit for this case. As a bonus, since burstsort is based on tries, it works ridiculously well for the small alphabet sizes used in DNA/RNA, since you don't need to build any sort of ternary search node, hash or other trie node compression scheme into the trie implementation. The tries may be useful for your suffix-array-like final goal as well.
A decent general purpose implementation of burstsort is available on source forge at http://sourceforge.net/projects/burstsort/ - but it is not in-place.
For comparison purposes, The C-burstsort implementation covered at http://www.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf benchmarks 4-5x faster than quicksort and radix sorts for some typical workloads.
您需要了解大规模基因组序列处理博士。 笠原和森下。
由四个核苷酸字母 A、C、G 和 T 组成的字符串可以专门编码为整数,以便更快地处理。 基数排序是本书讨论的众多算法之一; 您应该能够调整这个问题的公认答案并看到性能的巨大改进。
You'll want to take a look at Large-scale Genome Sequence Processing by Drs. Kasahara and Morishita.
Strings comprised of the four nucleotide letters A, C, G, and T can be specially encoded into Integers for much faster processing. Radix sort is among many algorithms discussed in the book; you should be able to adapt the accepted answer to this question and see a big performance improvement.
您可以尝试使用 trie。 对数据进行排序只是迭代数据集并插入它; 该结构是自然排序的,您可以将其视为类似于 B 树(除了不进行比较,您总是使用指针间接寻址)。
缓存行为将有利于所有内部节点,因此您可能不会对此进行改进; 但是您也可以调整 trie 的分支因子(确保每个节点都适合单个缓存行,分配类似于堆的 trie 节点,作为表示级别顺序遍历的连续数组)。 由于尝试也是数字结构(O(k) 插入/查找/删除长度为 k 的元素),因此您应该具有与基数排序竞争的性能。
You might try using a trie. Sorting the data is simply iterating through the dataset and inserting it; the structure is naturally sorted, and you can think of it as similar to a B-Tree (except instead of making comparisons, you always use pointer indirections).
Caching behavior will favor all of the internal nodes, so you probably won't improve upon that; but you can fiddle with the branching factor of your trie as well (ensure that every node fits into a single cache line, allocate trie nodes similar to a heap, as a contiguous array that represents a level-order traversal). Since tries are also digital structures (O(k) insert/find/delete for elements of length k), you should have competitive performance to a radix sort.
我会 burstsort 字符串的压缩位表示。 据称,突发排序比基数排序具有更好的局部性,用突发尝试代替经典尝试来降低额外的空间使用量。 原纸有尺寸。
I would burstsort a packed-bit representation of the strings. Burstsort is claimed to have much better locality than radix sorts, keeping the extra space usage down with burst tries in place of classical tries. The original paper has measurements.
看起来您已经解决了问题,但根据记录,可行的就地基数排序的一个版本似乎是“美国国旗排序”。 其描述如下:工程基数排序。 总体思路是对每个字符执行 2 次传递 - 首先计算每个字符有多少个,这样您就可以将输入数组细分为 bin。 然后再次检查,将每个元素交换到正确的容器中。 现在,对下一个字符位置上的每个容器进行递归排序。
It looks like you've solved the problem, but for the record, it appears that one version of a workable in-place radix sort is the "American Flag Sort". It's described here: Engineering Radix Sort. The general idea is to do 2 passes on each character - first count how many of each you have, so you can subdivide the input array into bins. Then go through again, swapping each element into the correct bin. Now recursively sort each bin on the next character position.
基数排序不具有缓存意识,也不是大型集合的最快排序算法。
您可以查看:
您还可以使用压缩并将 DNA 的每个字母编码为 2 位,然后再存储到排序数组中。
Radix-Sort is not cache conscious and is not the fastest sort algorithm for large sets.
You can look at:
You can also use compression and encode each letter of your DNA into 2 bits before storing into the sort array.
dsimcha 的 MSB 基数排序看起来不错,但 Nils 更接近问题的核心,因为他观察到缓存局部性是在处理大问题时导致你死亡的原因。
我建议采用一种非常简单的方法:
m
。m
个元素块,对它们进行基数排序,然后将它们写出(如果有足够的内存,则写入内存缓冲区,否则写入文件),直到耗尽输入。合并排序是我所知道的最适合缓存的排序算法:“从数组 A 或 B 中读取下一个项目,然后将一个项目写入输出缓冲区。” 它在磁带驱动器上高效运行。 它确实需要
2n
空间来对n
项进行排序,但我敢打赌,您将看到的大大改进的缓存局部性将使这一点变得不重要 - 如果您正在使用非就地基数排序,无论如何你都需要额外的空间。最后请注意,合并排序可以在没有递归的情况下实现,事实上,这样做可以清楚地表明真正的线性内存访问模式。
dsimcha's MSB radix sort looks nice, but Nils gets closer to the heart of the problem with the observation that cache locality is what's killing you at large problem sizes.
I suggest a very simple approach:
m
for which a radix sort is efficient.m
elements at a time, radix sort them, and write them out (to a memory buffer if you have enough memory, but otherwise to file), until you exhaust your input.Mergesort is the most cache-friendly sorting algorithm I'm aware of: "Read the next item from either array A or B, then write an item to the output buffer." It runs efficiently on tape drives. It does require
2n
space to sortn
items, but my bet is that the much-improved cache locality you'll see will make that unimportant -- and if you were using a non-in-place radix sort, you needed that extra space anyway.Please note finally that mergesort can be implemented without recursion, and in fact doing it this way makes clear the true linear memory access pattern.
首先,考虑问题的编码。 去掉字符串,用二进制表示替换它们。 使用第一个字节来指示长度+编码。 或者,在四字节边界使用固定长度表示。 那么基数排序就变得容易多了。 对于基数排序,最重要的是不要在内循环的热点处进行异常处理。
好吧,我对 4 进制问题想得更多一些。 您需要一个像 Judy 树 这样的解决方案。 下一个解决方案可以处理可变长度的字符串; 对于固定长度,只需删除长度位,这实际上使它更容易。
分配 16 个指针的块。 指针的最低有效位可以重复使用,因为您的块将始终对齐。 您可能需要一个特殊的存储分配器(将大存储分成更小的块)。 有多种不同类型的块:
对于每种类型的块,您需要在 LSB 中存储不同的信息。 由于您有可变长度的字符串,因此您也需要存储字符串结尾,并且最后一种块只能用于最长的字符串。 随着您对结构的深入了解,7 个长度位应该被替换为更少的长度位。
这为您提供了相当快速且非常内存高效的排序字符串存储。 它的行为有点像 trie。 为了使其正常工作,请确保构建足够的单元测试。 您希望覆盖所有块转换。 您只想从第二种块开始。
为了获得更高的性能,您可能需要添加不同的块类型和更大大小的块。 如果块始终大小相同且足够大,则可以使用更少的位作为指针。 对于 16 个指针的块大小,您在 32 位地址空间中已经有一个空闲字节。 查看 Judy 树文档以了解有趣的块类型。 基本上,您添加代码和工程时间以实现空间(和运行时)权衡。
您可能希望从前四个字符的 256 宽直接基数开始。 这提供了一个不错的空间/时间权衡。 在此实现中,与简单的 trie 相比,您获得的内存开销要少得多; 它大约小三倍(我没有测量)。 如果常数足够低,O(n) 就没有问题,正如您在与 O(n log n) 快速排序进行比较时注意到的那样。
您对处理双打感兴趣吗? 对于短序列,就会有。 调整块来处理计数很棘手,但它可以非常节省空间。
First, think about the coding of your problem. Get rid of the strings, replace them by a binary representation. Use the first byte to indicate length+encoding. Alternatively, use a fixed length representation at a four-byte boundary. Then the radix sort becomes much easier. For a radix sort, the most important thing is to not have exception handling at the hot spot of the inner loop.
OK, I thought a bit more about the 4-nary problem. You want a solution like a Judy tree for this. The next solution can handle variable length strings; for fixed length just remove the length bits, that actually makes it easier.
Allocate blocks of 16 pointers. The least significant bit of the pointers can be reused, as your blocks will always be aligned. You might want a special storage allocator for it (breaking up large storage into smaller blocks). There are a number of different kinds of blocks:
For each kind of block, you need to store different information in the LSBs. As you have variable length strings you need to store end-of-string too, and the last kind of block can only be used for the longest strings. The 7 length bits should be replaced by less as you get deeper into the structure.
This provides you with a reasonably fast and very memory efficient storage of sorted strings. It will behave somewhat like a trie. To get this working, make sure to build enough unit tests. You want coverage of all block transitions. You want to start with only the second kind of block.
For even more performance, you might want to add different block types and a larger size of block. If the blocks are always the same size and large enough, you can use even fewer bits for the pointers. With a block size of 16 pointers, you already have a byte free in a 32-bit address space. Take a look at the Judy tree documentation for interesting block types. Basically, you add code and engineering time for a space (and runtime) trade-off
You probably want to start with a 256 wide direct radix for the first four characters. That provides a decent space/time tradeoff. In this implementation, you get much less memory overhead than with a simple trie; it is approximately three times smaller (I haven't measured). O(n) is no problem if the constant is low enough, as you noticed when comparing with the O(n log n) quicksort.
Are you interested in handling doubles? With short sequences, there are going to be. Adapting the blocks to handle counts is tricky, but it can be very space-efficient.
虽然接受的答案完美地回答了问题的描述,但我已经到达这个地方,徒劳地寻找一种将内联数组划分为 N 个部分的算法。 我自己写了一篇,所以就在这里。
警告:这不是一种稳定的分区算法,因此对于多级分区,必须重新分区每个结果分区而不是整个数组。 优点是内联。
它有助于解决所提出的问题的方式是,您可以根据字符串的字母重复内联分区,然后在分区足够小时使用您选择的算法对分区进行排序。
While the accepted answer perfectly answers the description of the problem, I've reached this place looking in vain for an algorithm to partition inline an array into N parts. I've written one myself, so here it is.
Warning: this is not a stable partitioning algorithm, so for multilevel partitioning, one must repartition each resulting partition instead of the whole array. The advantage is that it is inline.
The way it helps with the question posed is that you can repeatedly partition inline based on a letter of the string, then sort the partitions when they are small enough with the algorithm of your choice.
只是在这里发布我的回复给那些仍在寻找此类问题的人(并且接受的答案可能无法解决问题) -
DNA 序列似乎出现在对。 我猜“A”与“T”配对,反之亦然,“G”与“C”配对,反之亦然。 这在一定程度上减少了这个问题。DNA 对在这里不适用。 这也从下面评论中的 OP 得到了证实。 考虑到给定字符串中一次针对四个字符的排序问题(因此字符串长度将是 4 的倍数)。鉴于上述情况,有效排列将是 'AT CG'、'AT GC'、'TA CG'、'TA GC'、'CG AT'、'CG TA'、'G AT'、 “GC TA”。 这使得总共有 8 种组合 - 假设我们按 AT/TA/CG/GC 对进行排序。给定四个字符排列,排列总数为 256(AAAA、AAAT、AAAG、AAAC...等等)!!这意味着我们只需要八个桶/箱/计数器。鉴于 256 种可能性,我们需要 256 个计数器。8256 种可能性之一的相应计数器。基本上,我们将问题陈述简化为一种计数排序(这种方法也是上面的答案之一,但在这种情况下,我只是将其称为计数排序,此外还尝试利用序列总是发生的事实成对 - 这有助于更好地管理问题陈述)
注意:还实现了就地基数计数排序混合,我相信它相当于上面提供的链接。
Just posting my response here for those who are still looking for this kind of problems (and where the accepted answer may not do the trick) -
DNA sequences appear to occur in pairs. 'A' pairs with 'T' or vice-versa I guess and 'G' pairs with 'C' or vice-versa. This reduces the problem to some extent.DNA pair does not apply here. This is confirmed from OP in the comments below as well. Considering the sorting problem to be for four characters at a time in a given string (the string length would therefore be a multiple of 4).Given the above, valid permutation would therefore be 'AT CG', 'AT GC', 'TA CG', 'TA GC', 'CG AT', 'CG TA', 'GC AT', 'GC TA'. That makes it 8 combinations over all - assuming we sort in AT/TA/CG/GC pairs.Given four character permutation, the total number of permutations is 256 (AAAA, AAAT, AAAG, AAAC... and so on) !!This means that we need only eight buckets / bins / counters.Given 256 possibilities, we'd need 256 counters.8256 possibilities.Basically we are reducing the problem statement to a kind of counting sort (this approach is also one of the answers above, but in this case I am simply calling it out as counting-sort besides also trying to leverage the fact that the sequences always occur in pairs - which helps manage the problem statement better)
Note: Have also implemented an in-place radix-counting-sort hybrid which I believe is equivalent to the link provided above.
既然字母这么少,为什么不数一下呢?
when there are so little letters, why not just count them?
嗯,这是 DNA 的 MSD 基数排序的简单实现。 它是用 D 语言编写的,因为这是我使用最多的语言,因此最不可能犯愚蠢的错误,但它可以很容易地翻译成其他语言。 它是就地的,但需要
2 * seq.length
穿过数组。显然,这是 DNA 特有的,而不是通用的,但它应该很快。
编辑:
我很好奇这段代码是否真的有效,所以我在等待我自己的生物信息学代码运行时测试/调试了它。 上面的版本现在已经经过实际测试并且可以工作。 对于 1000 万个每个 5 个碱基的序列,它比优化的内排序快大约 3 倍。
Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires
2 * seq.length
passes through the array.Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.
Edit:
I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.
我从未见过就地基数排序,并且从基数排序的性质来看,只要临时数组适合内存,我怀疑它比异地排序要快得多。
原因:
排序对输入数组进行线性读取,但所有写入几乎都是随机的。 从某个 N 向上,这可以归结为每次写入的缓存未命中。 这种缓存未命中会减慢算法的速度。 无论是否到位都不会改变此效果。
我知道这不会直接回答您的问题,但如果排序是一个瓶颈,您可能需要看看近排序算法作为预处理步骤(维基-软堆上的页面可能会帮助您入门)。
这可以提供非常好的缓存局部性提升。 教科书上的异位基数排序将表现得更好。 写入仍然几乎是随机的,但至少它们会聚集在相同的内存块周围,因此提高了缓存命中率。
我不知道它在实践中是否有效。
顺便说一句:如果您只处理 DNA 字符串:您可以将一个字符压缩为两位并打包大量数据。 与简单的表示相比,这会将内存需求减少四倍。 寻址变得更加复杂,但无论如何,CPU 的 ALU 在所有缓存未命中期间都会花费大量时间。
I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.
Reason:
The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.
I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).
That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.
I have no idea if it works out in practice though.
Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.
您当然可以通过按位对序列进行编码来降低内存需求。
您正在查看排列,因此对于长度为 2 的“ACGT”,它有 16 个状态或 4 位。
对于长度 3,即 64 个状态,可以用 6 位进行编码。 所以看起来序列中的每个字母都有 2 位,或者像你所说的那样,16 个字符大约有 32 位。
如果有办法减少有效“单词”的数量,则可能可以进一步压缩。
因此,对于长度为 3 的序列,可以创建 64 个存储桶,大小可能为 uint32 或 uint64。
将它们初始化为零。
迭代非常非常大的 3 个字符序列列表,并按上述方式对它们进行编码。
使用它作为下标,并增加该存储桶。
重复此操作,直到所有序列均已处理完毕。
接下来,重新生成您的列表。
按顺序迭代 64 个存储桶,对于在该存储桶中找到的计数,生成由该存储桶表示的序列的多个实例。
当所有的存储桶都被迭代后,你就得到了排序的数组。
4 的序列添加 2 位,因此将有 256 个桶。
5 的序列添加 2 位,因此将有 1024 个桶。
在某些时候,桶的数量将接近您的极限。
如果您从文件中读取序列,而不是将它们保存在内存中,则可以为存储桶提供更多内存。
我认为这比就地排序更快,因为存储桶可能适合您的工作集。
这是展示该技术的 hack
You can certainly drop the memory requirements by encoding the sequence in bits.
You are looking at permutations so, for length 2, with "ACGT" that's 16 states, or 4 bits.
For length 3, that's 64 states, which can be encoded in 6 bits. So it looks like 2 bits for each letter in the sequence, or about 32 bits for 16 characters like you said.
If there is a way to reduce the number of valid 'words', further compression may be possible.
So for sequences of length 3, one could create 64 buckets, maybe sized uint32, or uint64.
Initialize them to zero.
Iterate through your very very large list of 3 char sequences, and encode them as above.
Use this as a subscript, and increment that bucket.
Repeat this until all of your sequences have been processed.
Next, regenerate your list.
Iterate through the 64 buckets in order, for the count found in that bucket, generate that many instances of the sequence represented by that bucket.
when all of the buckets have been iterated, you have your sorted array.
A sequence of 4, adds 2 bits, so there would be 256 buckets.
A sequence of 5, adds 2 bits, so there would be 1024 buckets.
At some point the number of buckets will approach your limits.
If you read the sequences from a file, instead of keeping them in memory, more memory would be available for buckets.
I think this would be faster than doing the sort in situ as the buckets are likely to fit within your working set.
Here is a hack that shows the technique