当前位置：文江博客话题详情

无需迭代即可维护一组数值数据的哪些统计数据？

发布于 2024-08-08 18:52:26 字数 2255 浏览 5 评论 0 原文

更新

仅供将来参考，我将列出我所知道的可以在滚动集合中维护的所有统计信息，并在每次添加时重新计算为 O(1) 操作/去除（这实际上是我从一开始就应该如何措辞这个问题）：

明显

计数
总和
平均值
最大值*
最小值*
中位数**

较少明显

方差
标准偏差
偏度
峰度
模式***
加权平均
加权移动平均****

好的，更准确地说：这些并不是我所知道的统计数据的“全部”。它们只是我现在能从脑海中记起的那些。

_{*只能在 O(1) 中重新计算添加操作，或者如果集合已排序，则可以在添加和删除操作中重新计算（但在这种情况下，插入不是 O(1)）。对于未排序的集合，删除可能会导致 O(n) 重新计算。}

_{**仅对于已排序、索引的集合，在 O(1) 内重新计算。}

_{***需要在 O(1) 中重新计算相当复杂的数据结构。}

_{****当权重以线性下降方式分配时，这当然可以在 O(1) 中实现添加和删除。在其他情况下，我不确定。}

原始问题

假设我维护一组数字数据 - 比方说，只是一堆数字。对于这些数据，有大量可能感兴趣的计算值；一个例子是总和。要获得所有这些数据的总和，我可以...

选项 1：迭代集合，添加所有值：

double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];

选项 2：维护总和，无需迭代集合只是为了找到总和：

void Add(double value) {
    values.Add(value);
    sum += value;
}

void Remove(double value) {
    values.Remove(value);
    sum -= value;
}

编辑：为了用更相关的术语来表达这个问题，让我们将上面的两个选项与（某种）现实世界的情况进行比较：

假设我开始大声列出数字并询问你要把它们记在你的脑海里。我首先说：“11、16、13、12。”如果你只是记住数字本身，仅此而已，然后我说，“总和是多少？”，你必须自己想，“好吧，11 + 16 + 13 + 12 是多少？”然后回答：“52。”另一方面，如果当我列出数字时，您自己一直在记录总和（即，当我说“11”时，您认为是“11”，当我说“16”时，您认为是“11”） ”，你想，“27”等等），你可以立即回答“52”。然后，如果我说，“好吧，现在忘记数字 16”，如果你一直在脑子里记录总和，你可以简单地从 52 中减去 16，然后知道新的总和是 36，而不是减去 16列表和它们的总和 11 + 13 + 12。

所以我的问题是，除了像总和和平均值这样明显的计算之外，还有哪些其他计算是这样的？

第二次编辑：作为统计的任意示例（我几乎可以肯定）确实需要迭代 - 因此不能像求和或平均值一样简单地维护 - - 考虑一下我是否问你“这个集合中有多少个数字可以被最小值整除？”假设数字是 5、15、19、20、21、25 和 30。这个集合的最小值是 5，它分为 5、15、20、25 和 30（但不是 19 或 21），所以答案是 5。现在，如果我从集合中删除 5 并问同样的问题，答案现在是 2，因为只有 15 和 30 可以被新的最小值 15 整除；但是，据我所知，如果不再次查看集合就无法知道这一点。

所以我认为这触及了我的问题的核心：如果我们可以将种类的统计数据分为这些类别，那些可维护（我自己的术语，也许还有更多）官方的某个地方）与那些需要迭代来计算任何时候集合更改的情况，什么是所有可维护的？

我所问的问题与在线算法并不严格相同（尽管我真诚地感谢那些是你向我介绍了这个概念）。在线算法甚至可以在看到所有输入数据的情况下开始工作；我正在寻找的可维护的统计数据肯定已经看到了所有数据，他们只是不需要在数据发生变化时一遍又一遍地重申。

原文

Update

Just for future reference, I'm going to list all of the statistics that I'm aware of that can be maintained in a rolling collection, recalculated as an O(1) operation on every addition/removal (this is really how I should've worded the question from the beginning):

Obvious

Count
Sum
Mean
Max*
Min*
Median**

Less Obvious

Variance
Standard Deviation
Skewness
Kurtosis
Mode***
Weighted Average
Weighted Moving Average****

OK, so to put it more accurately: these are not "all" of the statistics I'm aware of. They're just the ones that I can remember off the top of my head right now.

_{*Can be recalculated in O(1) for additions only, or for additions and removals if the collection is sorted (but in this case, insertion is not O(1)). Removals potentially incur an O(n) recalculation for non-sorted collections.}

_{**Recalculated in O(1) for a sorted, indexed collection only.}

_{***Requires a fairly complex data structure to recalculate in O(1).}

_{****This can certainly be achieved in O(1) for additions and removals when the weights are assigned in a linearly descending fashion. In other scenarios, I'm not sure.}

Original Question

Say I maintain a collection of numerical data -- let's say, just a bunch of numbers. For this data, there are loads of calculated values that might be of interest; one example would be the sum. To get the sum of all this data, I could...

Option 1: Iterate through the collection, adding all the values:

double sum = 0.0;
for (int i = 0; i < values.Count; i++) sum += values[i];

Option 2: Maintain the sum, eliminating the need to ever iterate over the collection just to find the sum:

void Add(double value) {
    values.Add(value);
    sum += value;
}

void Remove(double value) {
    values.Remove(value);
    sum -= value;
}

EDIT: To put this question in more relatable terms, let's compare the two options above to a (sort of) real-world situation:

Suppose I start listing numbers out loud and ask you to keep them in your head. I start by saying, "11, 16, 13, 12." If you've just been remembering the numbers themselves and nothing more, and then I say, "What's the sum?", you'd have to think to yourself, "OK, what's 11 + 16 + 13 + 12?" before responding, "52." If, on the other hand, you had been keeping track of the sum yourself while I was listing the numbers (i.e., when I said, "11" you thought "11", when I said "16", you thought, "27," and so on), you could answer "52" right away. Then if I say, "OK, now forget the number 16," if you've been keeping track of the sum inside your head you can simply take 16 away from 52 and know that the new sum is 36, rather than taking 16 off the list and them summing up 11 + 13 + 12.

So my question is, what other calculations, other than the obvious ones like sum and average, are like this?

SECOND EDIT: As an arbitrary example of a statistic that (I'm almost certain) does require iteration -- and therefore cannot be maintained as simply as a sum or average -- consider if I asked you, "how many numbers in this collection are divisible by the min?" Let's say the numbers are 5, 15, 19, 20, 21, 25, and 30. The min of this set is 5, which divides into 5, 15, 20, 25, and 30 (but not 19 or 21), so the answer is 5. Now if I remove 5 from the collection and ask the same question, the answer is now 2, since only 15 and 30 are divisible by the new min of 15; but, as far as I can tell, you cannot know this without going through the collection again.

So I think this gets to the heart of my question: if we can divide kinds of statistics into these categories, those that are maintainable (my own term, maybe there's a more official one somewhere) versus those that require iteration to compute any time a collection is changed, what are all the maintainable ones?

What I am asking about is not strictly the same as an online algorithm (though I sincerely thank those of you who introduced me to that concept). An online algorithm can begin its work without having even seen all of the input data; the maintainable statistics I am seeking will certainly have seen all the data, they just don't need to reiterate through it over and over again whenever it changes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妥活 2024-08-15 18:52:26

首先，您在这里需要的术语是在线算法。所有矩（均值、标准差、偏斜等）都可以在线计算。其他包括最小值和最大值。请注意，中位数和众数无法在线计算。

回复收藏 0 原文

烟雨扶苏 2024-08-15 18:52:26

为了始终保持高/低，您按排序顺序存储数据。有一些算法可以维护保留顺序的数据结构。

如果数据是有序的，中位数是微不足道的。

如果将数据稍微缩减为频率表，则可以维持模式。如果将数据保留为随机、平面的值列表，则在存在变化的情况下将无法轻松计算众数。

回复收藏 0 原文

伴我心暖 2024-08-15 18:52:26

关于在线算法的这个问题的答案可能有用。关于满足您的需求的可用性，我想说，虽然一些在线算法可用于估计部分数据的汇总统计数据，但其他算法可用于根据您的喜好从数据流中维护它们。

您可能还想了解复杂事件处理（或 CEP），它用于跟踪和分析实时数据，例如在金融或网络商务中。我所知道的唯一免费 CEP 产品是 Esper。

回复收藏 0 原文

恋竹姑娘 2024-08-15 18:52:26

正如杰森说，你确实在描述一个在线算法。我还看到这种类型的计算被称为累加器模式，循环是显式实现的还是通过递归实现的。

回复收藏 0 原文

各自安好 2024-08-15 18:52:26

并不是对您问题的直接回答，但对于许多非在线统计的统计数据，您通常可以找到一些规则，仅在部分时间通过迭代进行计算，并在其余时间缓存正确的值。这对你来说可能足够好吗？

对于高价值例如：

public void Add(double value) {
    values.Add(value);
    if (value > highValue)
        highValue = value;
}

public void Remove(double value) {
    values.Remove(value);
    if (value.WithinTolerance(highValue))
        highValue = RecalculateHighValueByIteration();
}

Not really a direct answer to your question, but for many statistics that are not online statistics you can usually find some rules to calculate by iteration only part of the time, and cache the correct value the rest of the time. Is this possibly good enough for you?

For high value for example:

public void Add(double value) {
    values.Add(value);
    if (value > highValue)
        highValue = value;
}

public void Remove(double value) {
    values.Remove(value);
    if (value.WithinTolerance(highValue))
        highValue = RecalculateHighValueByIteration();
}

回复收藏 0 原文

季末如歌 2024-08-15 18:52:26

通过恒定时间添加和删除操作不可能保持高或低，因为这会给你一个线性时间排序算法。您可以使用搜索树按排序顺序维护数据，这将为您提供对数时间的最小值和最大值。如果您还保留子树大小和计数，那么找到中位数也很简单。

如果您只想在添加和删除的情况下保持高或低，请考虑优先级队列，它比搜索树更有效。

回复收藏 0 原文

难忘№最初的完美 2024-08-15 18:52:26

如果您事先不知道数据集的确切大小，或者它可能是无限的，或者您只是想要一些想法，那么您绝对应该研究流式传输算法。

回复收藏 0 原文

陌生 2024-08-15 18:52:26

听起来（即使在第二次编辑之后）您确实在描述在线算法，并且附加要求您希望允许“删除”操作。例如，用于在流中查找频繁项。

回复收藏 0 原文

~没有更多了~

关于作者

梦在深巷

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

无需迭代即可维护一组数值数据的哪些统计数据？

更新

明显

较少明显

原始问题

Update

Obvious

Less Obvious

Original Question

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

无需迭代即可维护一组数值数据的哪些统计数据？

更新

明显

较少明显

原始问题

Update

Obvious

Less Obvious

Original Question

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

留蓝

18790681156

zach7772

Wini

ayeshaaroy

初雪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。