对大于 RAM 大小的数据进行排序
这是谷歌面试问题: 给定 2 台机器,每台机器都有 64 GB RAM,包含所有整数(8 字节),对整个 128 GB 数据进行排序。您可以假设有少量额外的 RAM。扩展此功能以对存储在 1000 台机器中的数据进行排序。
我想出了外部排序。我们将整个数据分成块并对它们使用合并排序。这是首先对块进行排序,然后将它们放回去,然后再次分段并合并它们。有更好的办法吗?复杂程度如何?
This is a Google interview question:
Given 2 machines, each having 64 GB RAM, containing all integers (8 byte), sort the entire 128 GB data. You may assume a small amount of additional RAM. Extend this to sort data stored in 1000 machines.
I came up with external sort. In that we divide the entire data into chunks and use merge sort on them. That is the first sort the chunks and put them back and get them again piece wise and merge them. Is there a better way? What would be the complexity?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
ChingPing 提出对每个子集进行 O(n log n) 排序,然后进行线性合并(通过交换元素)。快速排序的问题(以及大多数 n log n 排序,是它们需要 n 内存。我建议使用 SmoothSort 使用常量内存,仍然以 O(n log n) 运行。
最坏的情况是这样的:
两个集合都按相反顺序排序,但合并却相反订单。
(国际海事组织 -更清楚)ChingPing 解决方案的解释是:
现在应该对集合进行排序。
ChingPing proposes a O(n log n) sort for each subset, followed by a linear merge (by swapping the elements). The problem with Quicksort (and most of the n log n sorts, is that they require n memory. I'd recommend instead using a SmoothSort which uses constant memory, still runs in O(n log n).
The worst case scenario is where you have something like:
where both sets are ordered in reverse, but then the merger is in the reverse order.
The (IMO - more clear) explanation of ChingPing's solution is:
The sets should both now be sorted.
每个 64 GB 都可以单独使用快速排序进行排序,然后使用外部存储器将指针保留在两个 64 GB 数组的头部,让我们考虑一下我们希望 RAM1 和 RAM2 按此顺序拥有整个数据,如果如果它小于 RAM2 中的指针值,否则将值与 RAM2 交换,直到指针到达 RAM1 的末尾。
采用同样的概念对所有 N 个 RAM 进行排序。取它们对并使用上述方法进行排序。剩下 N/2 个已排序的 RAM。递归地使用上面相同的概念。
Each of the 64 GB can be sorted using a quicksort separately and then using the external memory keep pointers at the heads of both 64GB array, lets consider we want RAM1 and RAM2 in that order to have the entire data, keep incrementing pointer at RAM1 if its smaller then the pointer value at RAM2 else swap the value with RAM2 until the pointer reached end of RAM1.
take the same concept to sort all N RAMs. Take pairs of them and sort using above method. You are left with N/2 sorted RAMs. Use the same concept above recursively.
2机壳已经有答案了。
我假设要排序的 128GB 数据作为单个文件存储在单个硬盘驱动器(或任何外部设备)上。无论使用多少台机器或硬盘,读取原始128GB文件和写入排序后的128GB文件所需的时间保持不变。唯一的节省发生在基于内部 RAM 的排序过程中,以创建排序数据块。与 n+1 硬盘驱动器合并以再次将单个排序的 128GB 文件合并到剩余硬盘驱动器上所需的时间保持不变,受限于将 128GB 排序文件写入剩余硬盘驱动器所需的时间硬盘。
对于 n 台机器,数据将被分割成 128GB/n 的块。每台机器都可以交替读取子块(一次可能是 64MB),以减少随机访问开销,这样“最后”一台机器就不会在启动之前等待所有先前的机器读取其所有块。
对于 n 台机器(每台 64GB 内存)和 n+1 个硬盘驱动器,其中 n >= 4,每台机器可以使用时间复杂度为 O(n) 的基数排序在 n 个工作硬盘驱动器上创建 32GB 或更小的块。同时,然后进行 n 路合并到目标硬盘上。
存在一个收益递减点限制了较大 n 的好处。超出 n > 的某个地方16、内部合并吞吐量可能变得大于磁盘 I/O 带宽。如果合并过程受 cpu 限制而不是 I/O 限制,则需要在 cpu 开销(并行创建块所需的时间)与大于 I/O 时间的合并开销之间进行权衡。
There are already answers for the 2 machine case.
I'm assuming that the 128GB of data to be sorted is stored as a single file on a single hard drive (or any external device). No matter how many machines or hard drives are used, the time it takes to read the original 128GB file and write the sorted 128GB file remains the same. The only savings occurs during the internal ram based sorts to create chunks of sorted data. The time it takes to merge with n+1 hard drives to do a n-way merge into a single sorted 128GB file onto the remaining hard drive again remains the same, limited by the time it takes to write the 128GB sorted file onto that remaining hard drive.
For n machines, the data would be split up into 128GB/n chunks. Each of the machines could alternate reading sub-chunks, perhaps 64MB at a time, to reduce random access overhead, so that the "last" machine isn't waiting for all of the prior machines to read all of their chunks before it even starts.
For n machines (64GB ram each) and n+1 hard drives with n >= 4, a radix sort with O(n) time complexity could be used by each machine to create 32GB or smaller chunks on the n working hard drives at the same time, followed by a n-way merge onto the destination hard drive.
There's a point of diminishing returns limiting the benefit of larger n. Somewhere beyond n > 16, the internal merge throughput could become greater than disk I/O bandwidth. If the merge process is cpu bound rather than I/O bound, there's a trade off in cpu overhead for the time it take to create chunks in parallel versus the merge overhead greater than I/O time.