使用较小的物理内存对 10 亿个整数进行排序
想要对 10 亿个整数进行排序,而我的系统只有 1 GB RAM。最快、最有效的排序方法是什么?
假设我们在文本文件中输入了一个每行一个整数。
我们正在使用java程序来排序。
我已指定 RAM,因为我们无法在 RAM 中保存所有输入整数。
更新: 整数是 7 位数字。
Want to SORT 1 BILLION of integer numbers and my system has just 1 GB of RAM.What could be the fastest and efficient way to sort?
Say we have an input in a text file an integer per line.
We are using java program to sort.
I have specified RAM as we cannot hold all the input integers in the RAM.
Update: Integers are 7 digit numbers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
所以只有一千万个可能的值。
您有 1GB 内存。创建一个计数器数组,每个计数器对应一个可能的值。
通读一次文件,对计数器进行计数。
完成后,根据最终计数器值输出数字。
每个数字最多可以出现 10 亿次。所以32位计数器就足够了。这意味着 10M x 4 字节 = 40M 字节数组。
So there are only 10 million possible values.
You have 1GB of RAM. Make an array of counters, one for each possible value.
Read through the file once, count up the counters.
When done, output the numbers according to the final counter values.
Every number can occur at most 1 billion times. So a 32bit counter would be enough. This means a 10M x 4 bytes = 40M byte array.
最简单的方法是将输入分解为适合内存的较小文件并对每个文件进行排序,然后合并结果。
Guido van Rossum 对做事有很好的描述这在Python中虽然显然不是同一种语言,但原理是相同的。
The simplest thing to do is break the input into smaller files that can fit in memory and sort each, and then merge the results.
Guido van Rossum has a good description of doing this in python while obviously not the same language the principle is the same.
您指定对十亿个 7(十进制)数字进行排序。
如果没有重复项,您可以使用基数排序在内存中以 107 位进行排序。由于必须有重复项(107 小于 109),因此您可以使用(例如)107 8 的数组来实现基数排序位计数器,使用
HashMap
来处理相对较少的计数器溢出情况。或者只是一个由 107 32 位计数器组成的数组。另一种更通用的方法(适用于任何类型的值)是将文件拆分为 N 个较小的子文件,对内存中的每个子文件进行排序,然后对排序后的子文件执行 N 路合并。
You specified that are sorting a billion 7 (decimal) digit numbers.
If there were no duplicates, you could sort in memory with 107 BITS using radix sort. Since you must have duplicates (107 less than 109), you could implement radix sort using (say) an array of 107 8-bit counters, with a
HashMap<Integer, Integer>
to deal with the relatively few cases where the counters overflow. Or just an array of 107 32-bit counters.Another more general approach (that works for any kind of value) is to split the file into N smaller subfiles, sort each subfile in memory, and then perform an N-way merge of the sorted subfiles.
使用具有 40 亿个可能值的 BitSet 会占用 512 MB。只需设置您看到的所有
int
值并按顺序写出它们(它们是自然排序的)这仅在您不关心重复项时才有效。
如果计算重复项很重要,我仍然会考虑使用内存映射文件进行计数,或者使用已排序的数据子部分的合并排序。 (我相信后者是预期的答案)
我最近以不到 1000 英镑的价格购买了一台 24 GB 的 PC,因此除非受到托管解决方案的限制,否则几 GB 并不算多。 (或使用移动设备)
Using a BitSet with 4 billion possible values occupies 512 MB. Just set all the
int
values you see and write them out in order (they are naturally sorted)This only works if you don't care about duplicates.
If counting duplicates matters I would still consider either a memory mapped file for counting, or using a merge sort of sorted subsections of data. (I believe the later is an expected answer)
I recently bough a 24 GB PC for under £1K, so a few GB isn't that much unless you limited by a hosted solution. (Or using a mobile device)
假设每个整数恰好出现一次,您可以读取文件,并且对于您找到的每个数字设置一位 - 位数组必须保存 10000000 位 - 这仅使用 1,28 MB RAM,应该可用...之后读取所有整数,您只需遍历数组并输出设置位的数字...
Assuming every integer occurs exactly one time you can read the file and for every number you find you set a bit - the bit array has to hold 10000000 bits - this uses only 1,28 MB RAM which should be available... after you have read all integers you just go through the array and output the numbers where a bit ist set...