更新
仅供将来参考,我将列出我所知道的可以在滚动集合中维护的所有统计信息,并在每次添加时重新计算为 O(1) 操作/去除(这实际上是我从一开始就应该如何措辞这个问题):
明显
- 计数
- 总和
- 平均值
- 最大值*
- 最小值*
- 中位数**
较少明显
- 方差
- 标准偏差
- 偏度
- 峰度
- 模式***
- 加权平均
- 加权移动平均****
好的,更准确地说:这些并不是我所知道的统计数据的“全部”。它们只是我现在能从脑海中记起的那些。
*只能在 O(1) 中重新计算添加操作,或者如果集合已排序,则可以在添加和删除操作中重新计算(但在这种情况下,插入不是 O(1))。对于未排序的集合,删除可能会导致 O(n) 重新计算。
**仅对于已排序、索引的集合,在 O(1) 内重新计算。
***需要在 O(1) 中重新计算相当复杂的数据结构。
****当权重以线性下降方式分配时,这当然可以在 O(1) 中实现添加和删除。在其他情况下,我不确定。
原始问题
假设我维护一组数字数据 - 比方说,只是一堆数字。对于这些数据,有大量可能感兴趣的计算值;一个例子是总和。要获得所有这些数据的总和,我可以...
选项 1:迭代集合,添加所有值:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
选项 2:维护总和,无需迭代集合只是为了找到总和:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
编辑:为了用更相关的术语来表达这个问题,让我们将上面的两个选项与(某种)现实世界的情况进行比较:
假设我开始大声列出数字并询问你要把它们记在你的脑海里。我首先说:“11、16、13、12。”如果你只是记住数字本身,仅此而已,然后我说,“总和是多少?”,你必须自己想,“好吧,11 + 16 + 13 + 12 是多少?”然后回答:“52。”另一方面,如果当我列出数字时,您自己一直在记录总和(即,当我说“11”时,您认为是“11”,当我说“16”时,您认为是“11”) ”,你想,“27”等等),你可以立即回答“52”。然后,如果我说,“好吧,现在忘记数字 16”,如果你一直在脑子里记录总和,你可以简单地从 52 中减去 16,然后知道新的总和是 36,而不是减去 16列表和它们的总和 11 + 13 + 12。
所以我的问题是,除了像总和和平均值这样明显的计算之外,还有哪些其他计算是这样的?
第二次编辑:作为统计的任意示例(我几乎可以肯定)确实需要迭代 - 因此不能像求和或平均值一样简单地维护 - - 考虑一下我是否问你“这个集合中有多少个数字可以被最小值整除?”假设数字是 5、15、19、20、21、25 和 30。这个集合的最小值是 5,它分为 5、15、20、25 和 30(但不是 19 或 21),所以答案是 5。现在,如果我从集合中删除 5 并问同样的问题,答案现在是 2,因为只有 15 和 30 可以被新的最小值 15 整除;但是,据我所知,如果不再次查看集合就无法知道这一点。
所以我认为这触及了我的问题的核心:如果我们可以将种类的统计数据分为这些类别,那些可维护(我自己的术语,也许还有更多)官方的某个地方)与那些需要迭代来计算任何时候集合更改的情况,什么是所有可维护的?
我所问的问题与在线算法并不严格相同(尽管我真诚地感谢那些是你向我介绍了这个概念)。在线算法甚至可以在看到所有输入数据的情况下开始工作;我正在寻找的可维护的统计数据肯定已经看到了所有数据,他们只是不需要在数据发生变化时一遍又一遍地重申。
Update
Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):
Obvious
- Count
- Sum
- Mean
- Max*
- Min*
- Median**
Less Obvious
- Variance
- Standard Deviation
- Skewness
- Kurtosis
- Mode***
- Weighted Average
- Weighted Moving Average****
OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.
*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.
**Recalculated in O(1) for a sorted, indexed collection only.
***Requires a fairly complex data structure to recalculate in O(1).
****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.
Original Question
Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...
Option 1: Iterate through the collection, adding all the values:
double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];
Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:
void Add(double value) {
values.Add(value);
sum += value;
}
void Remove(double value) {
values.Remove(value);
sum -= value;
}
EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:
Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.
So my question is, what other calculations, other than the obvious ones like sum and average, are like this?
SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.
So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?
What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.
发布评论
评论(8)
首先,您在这里需要的术语是在线算法。所有矩(均值、标准差、偏斜等)都可以在线计算。其他包括最小值和最大值。请注意,中位数和众数无法在线计算。
First, the term that you want here is online algorithm. All moments (mean, standard deviation, skew, etc.) can be calculated online. Others include the minimum and maximum. Note that median and mode can not be calculated online.
为了始终保持高/低,您按排序顺序存储数据。有一些算法可以维护保留顺序的数据结构。
如果数据是有序的,中位数是微不足道的。
如果将数据稍微缩减为频率表,则可以维持模式。如果将数据保留为随机、平面的值列表,则在存在变化的情况下将无法轻松计算众数。
To consistently maintain the high/low you store your data in sorted order. There are algorithms for maintaining data structures which preserves ordering.
Median is trivial if the data is ordered.
If the data is reduced slightly to a frequency table, you can maintain mode. If you keep your data as a random, flat list of values, you can't easily compute mode in the presence of change.
关于在线算法的这个问题的答案可能有用。关于满足您的需求的可用性,我想说,虽然一些在线算法可用于估计部分数据的汇总统计数据,但其他算法可用于根据您的喜好从数据流中维护它们。
您可能还想了解复杂事件处理(或 CEP),它用于跟踪和分析实时数据,例如在金融或网络商务中。我所知道的唯一免费 CEP 产品是 Esper。
The answers to this question on online algorithms might be useful. Regarding the usability for your needs, I'd say that while some online algorithms can be used for estimating summary statistics with partial data, others may be used to maintain them from a data flow just as you like.
You might also want to look at complex event processing (or CEP), which is used for tracking and analysing real time data, for example in finance or web commerce. The only free CEP product I know of is Esper.
正如 杰森说,你确实在描述一个在线算法。我还看到这种类型的计算被称为 累加器模式,循环是显式实现的还是通过递归实现的。
As Jason says, you are indeed describing an online algorithm. I've also seen this type of computation referred to as the Accumulator Pattern, whether the loop is implemented explicitly or by recursion.
并不是对您问题的直接回答,但对于许多非在线统计的统计数据,您通常可以找到一些规则,仅在部分时间通过迭代进行计算,并在其余时间缓存正确的值。这对你来说可能足够好吗?
对于高价值例如:
Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?
For high value for example:
通过恒定时间添加和删除操作不可能保持高或低,因为这会给你一个线性时间排序算法。您可以使用搜索树按排序顺序维护数据,这将为您提供对数时间的最小值和最大值。如果您还保留子树大小和计数,那么找到中位数也很简单。
如果您只想在添加和删除的情况下保持高或低,请考虑优先级队列,它比搜索树更有效。
It's not possible to maintain high or low with constant-time add and remove operations because that would give you a linear-time sorting algorithm. You can use a search tree to maintain the data in sorted order, which gives you logarithmic-time minimum and maximum. If you also keep subtree sizes and the count, it's simple to find the median too.
And if you just want to maintain the high or low in the presence of additions and removals, look into priority queues, which are more efficient for that purpose than search trees.
如果您事先不知道数据集的确切大小,或者它可能是无限的,或者您只是想要一些想法,那么您绝对应该研究 流式传输算法。
If you don't know the exact size of the dataset in advance, or if it is potentially unlmited, or you just want some ideas, you should definitely look into techniques used in Streaming Algorithms.
听起来(即使在第二次编辑之后)您确实在描述在线算法,并且附加要求您希望允许“删除”操作。例如,用于 在流中查找频繁项。
It does sound (even after your 2nd edit) that you are describing on-line algorithms, with the additional requirement that you want to allow "delete" operations. An example of this are the "sketch algorithms" used for finding frequent items in a stream.