如何在不保存整个数组且空间恒定的情况下计算排序数组的精确中位数?

发布于 2024-12-09 13:24:11 字数 116 浏览 0 评论 0原文

我需要从 awk/gawk 的输入读取排序数组并获取中值。我不想存储整个数组,并试图获得用于计算的恒定空间。

你知道有什么算法可以做到这一点吗?假设数组已排序,但其大小未知。

先感谢您!

I need to read sorted array from input to awk/gawk and get median. I don't want to store the whole array and am trying to get constant space for the calculation.

Are you aware of any algorithm doing this? Given the array is sorted but its size is unknown.

Thank you in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

叶落知秋 2024-12-16 13:24:11

没有算法可以准确找到使用固定内存量运行的未知长度的排序序列的中值。

要看到这一点,请考虑这样一个算法。假设它有一个长度为 N 的缓冲区,用于保存序列中的项目。在该缓冲区填满之前,算法只是将项目放入其中,并在此过程中跟踪中值。

当算法扫描第 N+1 个及以上的项目时,它必须在每一步选择一个要丢弃的项目。假设它已经扫描了 2N 个项目,并丢弃了其中的一半。让我们姑且相信它,并假设它尚未丢弃输入流的中值。

考虑一下它何时扫描第 2N+1 个项目。应该掉落哪件物品?它不能删除迄今为止保留的最小元素,因为输入可能在该项目之后耗尽,在这种情况下,最低的元素可能是中位数。同样,对于任何可能删除的元素,输入序列都有一个未来,使这个删除的元素成为中位数。

如果您愿意获取近似结果,那么此估计器可能适合您。

There is no algorithm to exactly find the median of a sorted sequence of unknown length that runs with a fixed amount of memory.

To see this, consider such an algorithm. Say it has a buffer of length N for holding items from the sequence. Until this buffer is full, the algorithm simply puts items in it, tracking the median while it does so.

When the algorithm scans the N+1th item and beyond, it must choose one item to drop at each step. Suppose it has already scanned 2N items, dropping half of them. Let's give it the benefit of the doubt, and say it has not yet dropped median of the input stream.

Consider when it is scans the 2N+1th item. Which item should it drop? It can't drop the least element it has kept so far, since input may be exhausted after this item, in which case the lowest may be the median. Likewise for any possible element it may drop, there is a future to the input sequence that makes this dropped element the median.

If you are willing to take approximate results, then this estimator may work for you.

家住魔仙堡 2024-12-16 13:24:11

进行两次传递,第一次仅用于计算数组的大小,如有必要,请将数据存储在文件中。否则,如果不存储数组就无法做到这一点,因为如果在读取 n 个项目后获取程序的状态,那么通过向其提供足够大的数字,您可以检索最后 n/2 个项目中的任何一个作为中位数,所以事实上,程序必须至少记住这些项目。

Take two passes, using the first just to work out the size of the array, and storing the data in a file, if necessary. Otherwise you can't do it without storing the array, because if you take the state of the program after reading n items, then by feeding it enough very large numbers you can retrieve any of the last n/2 items as the median, so the program must in fact be remembering at least those items.

一身骄傲 2024-12-16 13:24:11

基本上你要求的是一个“算法”来找到数组的大小N,因为中位数将是元素数量(N+1)/2(现在忽略偶数/奇数细节)。

我想不出不涉及两次传递的算法。根据定义,您需要第一遍才能算出 N

在扫描元素 i+1 时,您可以保留之前的 i/2 元素的缓冲区。当到达数组末尾时,中位数将只是缓冲区中的第一个值,即只需要一次传递。这样做的问题是,您必须为缓冲区分配足够的内存来包含 N/2 元素 - 但您不知道 N 是什么,所以您不知道缓冲区应该有多大!此外,如果 N 值太大而无法存储,正如您在问题中所述,那么大概 N/2 值也太大而无法存储(否则我的建议是:只需将您的 RAM 加倍即可)。

所以这种缓冲方法不是一个选择。就两关了。一种计算N,一种获取元素(N+1)/2

Basically what you're asking for is an "algorithm" to find the size N of the array, because the median will be element number (N+1)/2 (ignoring even/odd details for now).

I can't think of an algorithm that doesn't involve two passes. By definition, you need a first pass to figure out N.

While scanning element i+1, you could keep a buffer of the previous i/2 elements. When you reach the end of the array, the median would just be the first value in the buffer, i.e requiring only one pass. The problem with this is that you would have to allocate enough memory for the buffer to contain N/2 elements -- but you don't know what N is, so you don't know how large the buffer should be! Also if N values is too large to store, as you state in the question, then presumably N/2 values is also too large to store (otherwise my advice would be: just double your RAM).

So this buffer approach isn't an option. Two passes it is. One to figure out N, one to get element (N+1)/2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文