从集合中返回中间的 n(值非索引)
我有一个 List
例如,给定以下列表,如果我想要中间 80%,我会期望 11 和 100 将被删除。
11、22、22、33、44、44、55、55、55、100。
在 LINQ 中是否有一种简单/内置的方法可以做到这一点?
I have a List<int>
and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.
For instance, given the following list if I wanted the middle 80% i would expect that the 11 and 100 would be removed.
11,22,22,33,44,44,55,55,55,100.
Is there an easy / built in way to do this in LINQ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
正确去除异常值完全取决于准确描述数据分布的统计模型——您尚未向我们提供该模型。
假设它是正态(高斯)分布,这就是您想要做的。
首先计算平均值。这很容易;它只是总和除以项目数。
其次,计算标准差。标准差是衡量数据围绕平均值的“分布”程度的指标。计算方法:
在正态分布中 80%项目在平均值的 1.2 个标准差之内。例如,假设平均值为 50,标准差为 20。您预计 80% 的样本将落在 50 - 1.2 * 20 和 50 + 1.2 * 20 之间。然后您可以从列表中过滤掉项目超出该范围的。
但请注意,这并没有消除“异常值”。这是删除与平均值相差超过 1.2 个标准差的元素,以获得平均值周围 80% 的区间。在正态分布中,人们期望定期看到“异常值”。 99.73% 的项目都在平均值的三个标准差之内,这意味着如果您有一千个观测值,那么看到两个或三个观测值超出平均值三个标准差以上是完全正常的!事实上,在给定一千个观测值时,任何地方最多有五个观测值与平均值相差超过三个标准差。
我认为您需要非常仔细地定义异常值的含义,并描述为什么您试图消除它们。看起来像异常值的事情可能根本不是异常值,它们是您应该关注的真实数据。
另请注意,如果正态分布不正确,则此分析均不正确!消除看似异常值的过程可能会遇到很大的麻烦,而实际上整个统计模型都是错误的。如果模型比正态分布更“尾重”,则异常值很常见,并且实际上不是异常值。当心!如果您的分布不正常,那么您需要告诉我们分布是什么,然后我们才能建议如何识别异常值并消除它们。
Removing outliers correctly depends entirely on the statistical model that accurately describes the distribution of the data -- which you have not supplied for us.
On the assumption that it is a normal (Gaussian) distribution, here's what you want to do.
First compute the mean. That's easy; it's just the sum divided by the number of items.
Second, compute the standard deviation. Standard deviation is a measure of how "spread out" the data is around the mean. Compute it by:
In a normal distribution 80% of the items are within 1.2 standard deviations of the mean. So, for example, suppose the mean is 50 and the standard deviation is 20. You would expect that 80% of the sample would fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. You can then filter out items from the list that are outside of that range.
Note however that this is not removing "outliers". This is removing elements that are more than 1.2 standard deviations from the mean, in order to get an 80% interval around the mean. In a normal distribution one expects to see "outliers" on a regular basis. 99.73% of items are within three standard deviations of the mean, which means that if you have a thousand observations, it is perfectly normal to see two or three observations more than three standard deviations outside the mean! In fact, anywhere up to, say, five observations more than three standard deviations away from the mean when given a thousand observations probably does not indicate an outlier.
I think you need to very carefully define what you mean by outlier and describe why you are attempting to eliminate them. Things that look like outliers are potentially not outliers at all, they are real data that you should be paying attention to.
Also, note that none of this analysis is correct if the normal distribution is incorrect! You can get into big, big trouble eliminating what look like outliers when in fact you've actually got the entire statistical model wrong. If the model is more "tail heavy" than the normal distribution then outliers are common, and not actually outliers. Be careful! If your distribution is not normal then you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.
您可以使用
Enumerable.OrderBy
方法对列表进行排序,然后使用Enumerable.Skip
和Enumerable.Take
函数,例如:其中
nums
是整数列表。如果您只想要“中间的 n 个值”,那么确定使用哪些值作为
Skip
和Take
的参数应该如下所示:但是,当
(nums.Count - n) / 2
的结果不是整数时,您希望代码如何表现?You could use the
Enumerable.OrderBy
method to sort your list, then useEnumerable.Skip
and theEnumerable.Take
functions, e.g.:Where
nums
is your list of integers.Figuring out what values to use as arguments for
Skip
andTake
should look something like this, if you just want the "middlen
values":However, when the result of
(nums.Count - n) / 2
is not an integer, how do you want the code to behave?假设您没有做任何加权平均有趣的事情:
然后您可以根据需要过滤权重。根据需要放下顶部/底部 n%。
在你的情况下:
编辑:作为扩展方法,因为我喜欢扩展方法:
用法:
Assuming you're not doing any weighted average funny business:
You can then filter on Weight as needed. Drop the top/botton n% as desired.
In your case:
Edit: As an extension method, because I like extension methods:
Usage:
通常,如果您想从一组值中排除统计异常值,您需要计算该组值的算术平均值和标准差,然后删除距离平均值较您想要的值(以标准差衡量)的值。正态分布(经典的钟形曲线)具有以下属性:
您可以在 http://www.codeproject.com/KB/linq/LinqStatistics.aspx" rel="nofollow">http:// /www.codeproject.com/KB/linq/LinqStatistics.aspx
Normally, if you wanted to exclude statistical outliers from a set of values, you'd compute the arithmetic mean and standard deviation for the set, and then remove values lying further from the mean than you'd like (measure in standard deviations). A normal distribution — your classic bell-shaped curve — exhibits the following properties:
You can get Linq extension methods for computation of standard deviation (and other statistical functions) at http://www.codeproject.com/KB/linq/LinqStatistics.aspx
我不会质疑计算异常值的有效性,因为我也有类似的需要进行这种选择。取中间 n 的具体问题的答案是:
这会跳过第一项,并在最后一项之前停止,只提供中间 n 项。以下是演示此查询的 .NET Fiddle 的链接。
https://dotnetfiddle.net/p1z7em
I am not going to question the validity of calculating outliers since I had a similar need to do exactly this kind of selection. The answer to the specific question of taking the middle n is:
This skips the first item, and stops before the last giving you just the middle n items. Here is a link to a .NET Fiddle demonstrating this query.
https://dotnetfiddle.net/p1z7em
如果我理解正确的话,我们希望保留 11-100 范围中间 80% 的任何值,或者
假设一个有序列表,我们可以跳过当值低于
lowerBound
时,然后TakeWhile 数字比upperBound
更可爱If I understand correctly we want to keep any values that fall into the middle 80% of the 11-100 range, or
Assuming an ordered list, we can SkipWhile the values are lower than the
lowerBound
, and then TakeWhile the numbers are lover than theupperBound