计算十亿个数字的中位数

梦里的微风 2024-09-03 05:38:44

啊，我的大脑刚刚启动，我现在有了一个明智的建议。如果这是一次采访，可能就太晚了，但没关系：

机器 1 应该被称为“控制机器”，并且为了争论，它要么从所有数据开始，然后将其平等地发送到其他 99 台机器，否则数据开始在机器之间均匀分布，并且它将 1/99 的数据发送给其他机器。分区不必相等，只要接近即可。

每台其他机器都会对其数据进行排序，并以有利于首先找到较低值的方式进行排序。例如，快速排序，总是首先对分区的下部进行排序[*]。它会尽快将数据以递增的顺序写回控制机（使用异步 IO 以便继续排序，并且可能在 Nagle 打开时：进行一下实验）。

控制机在数据到达时对其执行 99 路合并，但丢弃合并的数据，只记录它所看到的值的数量。它将中位数计算为 1/2 十亿分之一和 1/2 十亿分之一值的平均值。

这面临着“群体中最慢”的问题。直到排序机发送了所有小于中位数的值后，该算法才能完成。有一个合理的机会，这样一个值在其数据包中会相当高。因此，一旦数据的初始分区完成，预计运行时间是排序 1/99 数据并将其发送回控制计算机的时间与控制器读取 1/2 数据的时间的组合。 “组合”介于最大值和这些时间的总和之间，可能接近最大值。

我的直觉是，为了通过网络发送数据比排序数据更快（更不用说仅选择中位数），它需要是一个非常快的网络。如果可以假定网络是瞬时的，那么前景可能会更好，例如，如果您有 100 个内核，可以平等地访问包含数据的 RAM。

由于网络 I/O 可能会受到限制，因此您可能可以使用一些技巧，至少对于返回控制机的数据而言。例如，排序机可以发送一条表示“100 个值小于 101”的消息，而不是发送“1,2,3,.. 100”。然后，控制机可以执行修改后的合并，其中它找到所有这些顶级值中的最小值，然后告诉所有分拣机它是什么，以便它们可以（a）告诉控制机如何许多值要“计数”低于该值，并且 (b) 从该点恢复发送排序的数据。

更一般地说，控制机可以与 99 台分拣机一起玩一个巧妙的挑战-响应猜测游戏。

不过，这涉及机器之间的往返，而我更简单的第一个版本避免了这种情况。我真的不知道如何盲目估计他们的相对表现，而且由于权衡很复杂，我想有比我自己想到的更好的解决方案，假设这是一个真正的问题。

[*] 可用堆栈允许 - 如果您没有 O(N) 额外空间，您对首先执行哪一部分的选择会受到限制。但是，如果你确实有足够的额外空间，你可以选择，如果你没有足够的空间，你至少可以使用你所拥有的东西来抄近路，通过先为前几个分区做小部分。

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.

计算十亿个数字的中位数

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（25）

关于作者

相关话题

热门标签

推荐作者

花开柳相依

zyhello

故友

对风讲故事

Oo萌小芽oO

梦明

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。