LINQ 中的标准差
LINQ 是否对聚合 SQL 函数 STDDEV()
(标准差)进行建模?
如果不是,最简单/最佳实践的计算方法是什么?
例子:
SELECT test_id, AVERAGE(result) avg, STDDEV(result) std
FROM tests
GROUP BY test_id
Does LINQ model the aggregate SQL function STDDEV()
(standard deviation)?
If not, what is the simplest / best-practices way to calculate it?
Example:
SELECT test_id, AVERAGE(result) avg, STDDEV(result) std
FROM tests
GROUP BY test_id
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
您可以制作自己的扩展来计算它
如果您有总体样本而不是整个总体,那么您应该使用 ret = Math.Sqrt(sum / (count - 1)) ;。
从 向 LINQ 添加标准差转换为扩展作者:克里斯·贝内特。
You can make your own extension calculating it
If you have a sample of the population rather than the whole population, then you should use
ret = Math.Sqrt(sum / (count - 1));
.Transformed into extension from Adding Standard Deviation to LINQ by Chris Bennett.
Dynami 的答案有效,但需要多次传递数据才能得到结果。这是计算样本标准差的单遍方法:
这是样本标准差,因为它除以
n - 1
。对于正常标准差,您需要除以n
。这使用韦尔福德方法Average(x^2)-Average(x)^2 方法相比,a> 具有更高的数值精度。
Dynami's answer works but makes multiple passes through the data to get a result. This is a single pass method that calculates the sample standard deviation:
This is the sample standard deviation since it divides by
n - 1
. For the normal standard deviation you need to divide byn
instead.This uses Welford's method which has higher numerical accuracy compared to the
Average(x^2)-Average(x)^2
method.这会将 David Clarke 的答案转换为一个扩展,该扩展遵循与其他聚合 LINQ 函数(如 Average)相同的形式。
用法是: var stdev = data.StdDev(o => o.number)
This converts David Clarke's answer into an extension that follows the same form as the other aggregate LINQ functions like Average.
Usage would be:
var stdev = data.StdDev(o => o.number)
开门见山(C# > 6.0),Dynamis 的答案变成了这样:
编辑 2020-08-27:
我接受了 @David Clarke 的评论来进行一些性能测试
这是结果:
我用一百万个随机双精度列表对此进行了测试
原始实现的运行时间约为 48 毫秒
性能优化实现2-3ms
所以这是一个重大改进。
一些有趣的细节:
摆脱 Math.Pow 会带来 33 毫秒的提升!
列表而不是 IEnumerable 6ms
手动平均计算4ms
For 循环而不是 ForEach 循环 2ms
数组而不是列表只带来了约 2% 的改进,所以我跳过了这个
使用 single 而不是 double 不会带来任何结果
进一步降低代码并使用 goto (是的 GOTO...自 90 年代汇编程序以来就没有使用过这个...)而不是 for 循环
不付费,谢天谢地!
我也测试过并行计算,这在列表上是有意义的> 200.000 件商品
似乎硬件和软件需要初始化很多,这对于小列表来说会适得其反。
所有测试连续执行两次以消除预热时间。
Straight to the point (and C# > 6.0), Dynamis answer becomes this:
Edit 2020-08-27:
I took @David Clarke comments to make some performance tests
and this are the results:
I tested this with a list of one million random doubles
the original implementation had an runtime of ~48ms
the performance optimized implementation 2-3ms
so this is an significant improvement.
Some interesting details:
getting rid of Math.Pow brings a boost of 33ms!
List instead of IEnumerable 6ms
manually Average calculation 4ms
For-loops instead of ForEach-loops 2ms
Array instead of List brings just an improvement of ~2% so i skipped this
using single instead of double brings nothing
Further lowering the code and using goto (yes GOTO... haven't used this since the 90s assembler...) instead of for-loops
does not pay, Thank goodness!
I have tested also parallel calculation, this makes sense on list > 200.000 items
It seems that Hardware and Software needs to initialize a lot and this is for small lists contra-productive.
All tests were executed two times in a row to get rid of the warmup-time.
简单的 4 行,我使用了双精度列表,但可以使用 IEnumerable;值
Simple 4 lines, I used a List of doubles but one could use
IEnumerable<int> values
在一般情况下,我们希望在一次中计算
StdDev
:如果values
是文件怎么办em> 或 RDBMS 光标计算平均值和总和之间可以更改哪个?我们将得到不一致的结果。这
下面的代码仅使用一次传递:
sample
StdDev
的想法完全相同:In general case we want to compute
StdDev
in one pass: what ifvalues
is file or RDBMS cursorwhich can be changed between computing average and sum? We are going to have inconsistent result. The
code below uses just one pass:
The very same idea for sample
StdDev
: