具有最大内存效率的增量中值计算
我有一个产生价值并且我观察的过程。当进程终止时,我想计算这些值的中值。
如果我必须计算平均值,我可以只存储总和以及生成值的数量,因此内存需求为 O(1)。中位数怎么样?有没有办法节省存储所有值带来的明显的 O(n) 时间?
编辑:对两种情况感兴趣:1)流长度已知,2)不知道。
I have a process that generates values and that I observe. When the process terminates, I want to compute the median of those values.
If I had to compute the mean, I could just store the sum and the number of generated values and thus have O(1) memory requirement. How about the median? Is there a way to save on the obvious O(n) coming from storing all the values?
Edit: Interested in 2 cases: 1) the stream length is known, 2) it's not.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您将需要存储至少 ceil(n/2) 个点,因为前 n/2 个点中的任何一个都可能是中位数。最简单的方法可能是存储点并找到中位数。如果保存 ceil(n/2) 点是有价值的,则将前 n/2 点读入排序列表(二叉树可能是最好的),然后在添加新点时丢弃低点或高点并保留跟踪两端抛出的点数。
编辑:
如果流长度未知,那么显然,正如斯蒂芬在评论中观察到的那样,我们别无选择,只能记住一切。如果可能有重复的项目,我们可以使用海豚存储值和计数的思想来节省一些内存。
You are going to need to store at least ceil(n/2) points, because any one of the first n/2 points could be the median. It is probably simplest to just store the points and find the median. If saving ceil(n/2) points is of value, then read in the first n/2 points into a sorted list (a binary tree is probably best), then as new points are added throw out the low or high points and keep track of the number of points on either end thrown out.
Edit:
If the stream length is unknown, then obviously, as Stephen observed in the comments, then we have no choice but to remember everything. If duplicate items are likely, we could possibly save a bit of memory using Dolphins idea of storing values and counts.
我遇到了同样的问题,并得到了一种尚未发布在这里的方法。希望我的回答可以帮助将来的人。
如果您知道值范围并且不太关心中值精度,则可以使用常量内存逐步创建量化值的直方图。然后很容易找到中值或值的任何位置以及量化误差。
例如,假设您的数据流是图像像素值,并且您知道这些值都是 0~255 范围内的整数。要增量创建图像直方图,只需创建 256 个从 0 开始的计数器(箱),并在扫描输入时对与像素值对应的箱计数 1。创建直方图后,找到大于数据大小一半的第一个累积计数以获得中值。
对于实数数据,您仍然可以计算直方图,其中每个 bin 具有量化值(例如 10、1 或 0.1 等 bin),具体取决于您期望的数据值范围和所需的精度。
如果你不知道整个数据样本的取值范围,你仍然可以估计中位数可能的取值范围,并在这个范围内计算直方图。这本质上会丢弃异常值,但这正是我们在计算中位数时想要的。
I had the same problem and got a way that has not been posted here. Hopefully my answer can help someone in the future.
If you know your value range and don't care much about median value precision, you can incrementally create a histogram of quantized values using constant memory. Then it is easy to find median or any position of values, with your quantization error.
For example, suppose your data stream is image pixel values and you know these values are integers all falling within 0~255. To create the image histogram incrementally, just create 256 counters (bins) starting from zeros and count one on the bin corresponding to the pixel value while scanning through the input. Once the histogram is created, find the first cumulative count that is larger than half of the data size to get median.
For data that are real numbers, you can still compute histogram with each bin having quantized values (e.g. bins of 10's, 1's, or 0.1's etc.), depending on your expected data value range and precision you want.
If you don't know the value range of entire data sample, you can still estimate the possible value range of median and compute histogram within this range. This drops outliers by nature but is exactly what we want when computing median.
如果可以接受,您可以
k
个不同的值意味着存储O(k)
内存)O(n)
的一个较小常数。You can
k
distinct values means storingO(k)
memory)O(n)
.如果您有离散值和大量重复,您可以存储值和计数,这将节省一些空间。
可能在计算的各个阶段,您可以丢弃顶部“n”和底部“n”值,只要您确定中位数不在顶部或底部范围内即可。
例如,假设您期望 100,000 个值。每当您存储的数字达到(例如)12,000 时,您就可以丢弃最高的 1000 个和最低的 1000 个,将存储量降回 10,000。
如果值的分布相当一致,那么这会很有效。但是,如果您有可能在临近结束时收到大量非常高或非常低的值,则可能会扭曲您的计算。基本上,如果您丢弃小于(最终)中值的“高”值或等于或大于(最终)中值的“低”值,那么您的计算就会失败。
更新
举个例子
假设数据集是数字 1,2,3,4,5,6,7,8,9。
通过检查,中位数是 5。
假设您得到的前 5 个数字是 1,3,5,7,9。
为了节省空间,我们丢弃最高和最低的值,留下 3,5,7
现在再加两个,2,6,所以我们的存储空间是 2,3,5,6,7
丢弃最高和最低,留下 3,5,6
获取最后两个 4,8,我们有 3,4,5,6,8
中位数仍然是 5,世界是个好地方。
但是,假设我们得到的前五个数字是 1,2,3,4,5
丢弃顶部和底部,留下 2,3,4
再加上两个 6,7,我们就有 2,3,4,6,7
丢弃顶部和底部,留下 3,4,6
获取最后两个 8,9,我们有 3,4,6,8,9
中位数为 6,这是不正确的。
如果我们的人数分布得好,我们就可以继续修剪四肢。如果它们可能聚集成很多大的或很多小的数量,那么丢弃是有风险的。
If you have discrete values and lots of repetition you could store the values and counts, which would save a bit of space.
Possibly at stages through the computation you could discard the top 'n' and bottom 'n' values, as long as you are sure that the median is not in that top or bottom range.
e.g. Let's say you are expecting 100,000 values. Every time your stored number gets to (say) 12,000 you could discard the highest 1000 and lowest 1000, dropping storage back to 10,000.
If the distribution of values is fairly consistent, this would work well. However if there is a possibility that you will receive a large number of very high or very low values near the end, that might distort your computation. Basically if you discard a "high" value that is less than the (eventual) median or a "low" value that is equal or greater than the (eventual) median then your calculation is off.
Update
Bit of an example
Let's say that the data set is the numbers 1,2,3,4,5,6,7,8,9.
By inspection the median is 5.
Let's say that the first 5 numbers you get are 1,3,5,7,9.
To save space we discard the highest and lowest, leaving 3,5,7
Now get two more, 2,6 so our storage is 2,3,5,6,7
Discard the highest and lowest, leaving 3,5,6
Get the last two 4,8 and we have 3,4,5,6,8
Median is still 5 and the world is a good place.
However, lets say that the first five numbers we get are 1,2,3,4,5
Discard top and bottom leaving 2,3,4
Get two more 6,7 and we have 2,3,4,6,7
Discard top and bottom leaving 3,4,6
Get last two 8,9 and we have 3,4,6,8,9
With a median of 6 which is incorrect.
If our numbers are well distributed, we can keep trimming the extremities. If they might be bunched in lots of large or lots of small numbers, then discarding is risky.