简单的数据处理

发布于 2024-10-16 08:52:12 字数 515 浏览 5 评论 0原文

假设我得到了这组数据。排序后可以得出如下分布。

M=[-99  -99 -44.5   -7.375  -5.5    -1.666666667    -1.333333333    -1.285714286    0.436363636 2.35    3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4    19.2    19.6    20.75   24.25   34.5    49.5]

plot for the data

我的问题是如何找出中间范围内的那些值并记录索引。使用正态分布还是其他什么？感谢您的帮助！

乔纳斯的图片在此处输入图像描述

原文

Let's say I got this set of data. After sorting the distribution can be drawn out like below.

M=[-99  -99 -44.5   -7.375  -5.5    -1.666666667    -1.333333333    -1.285714286    0.436363636 2.35    3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4    19.2    19.6    20.75   24.25   34.5    49.5]

plot for the data

My question is how do I find out those values that are among the middle range and record the indices. Using normal distribution or anything else? Appreciate your help!

Picture for Jonas' enter image description here

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雪落纷纷 2024-10-23 08:52:12

假设您的中间范围是 [-10 10] 那么索引将是：

> find(-10< M & M< 10)
ans =

    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18

请注意，您也可以通过逻辑索引访问这些值，例如：

> M(-10< M & M< 10)
ans =

 Columns 1 through 15:

  -7.37500  -5.50000  -1.66667  -1.33333  and so on ...

要获得您的中间范围，只需：

> q= quantile(M(:), [.25 .75])
q =

   -1.3214
   17.0917

> find(q(1)< M & M< q(2))
ans =

    8    9   10   11   12   13   14   15   16   17   18   19   20

另请注意 M(:)此处使用 来确保 quantile 将 M 视为向量。您可以采用约定，程序中的所有向量都是列向量，然后大多数函数会自动正确处理它们。

更新：
现在，对分位数的一个非常简短的描述是：它们是从随机变量的累积分布函数（cdf）中获取的点。（现在您的 M 被假定为一种 cdf，因为它是非递减的并且可以标准化为总和为 1）。现在“简单地”数据的分位数 0.5“意味着 50% 的值低于该分位数”。有关分位数的更多详细信息，请参见此处。

Assuming your mid range is [-10 10] then the indices would be:

> find(-10< M & M< 10)
ans =

    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18

Please note that you can acces the values also by logical indexing, like:

> M(-10< M & M< 10)
ans =

 Columns 1 through 15:

  -7.37500  -5.50000  -1.66667  -1.33333  and so on ...

And to get your mid range, just:

> q= quantile(M(:), [.25 .75])
q =

   -1.3214
   17.0917

> find(q(1)< M & M< q(2))
ans =

    8    9   10   11   12   13   14   15   16   17   18   19   20

Note also that M(:) is used here to ensure that quantile treats M as vector. You may adopt the convention that all vectors in your programs are column vectors, then most of the functions automatically treats them correctly.

Update:
Now, for a very short description of quantiles is that: they are points taken from the cumulative distribution function (cdf) of a random variable. (Now your M is assumed to be a kind of cdf, since its nondecreasing and can be normalized to sum up to 1). Now 'simply' a quantile .5 of your data 'means that 50% of the values are lower than this quantile'. More details on quantiles can be found for example here.

回复收藏 0 原文

乜一 2024-10-23 08:52:12

如果您先验不知道您的中间范围是什么，但您知道您想要丢弃曲线开头和结尾处的异常值，并且如果您有统计工具箱您可以使用 ROBUSTFIT 对数据进行稳健的线性回归，并且只保留内点。

M=[-99 -99 -44.5 -7.375 -5.5 -1.666666667 -1.333333333 -1.285714286 0.436363636 2.35 3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4 19.2 19.6 20.75 24.25 34.5 49.5];

%# robust linear regression
x = find(isfinite(M)); %# eliminate NaN or Inf
[u,s]=robustfit(x,M(x));

%# inliers have a weight > 0.25 (raise this value to be stricter)
inlierIdx = s.w > 0.25;
middleRangeX = x(inlierIdx)
middleRangeValues = M(x(inlierIdx))

%# plot with the regression in red and the good values in green
plot(x,M(x),'-b.',x,u(1)+u(2)*x,'r')
hold on,plot(middleRangeX,middleRangeValues,'*r')

middleRangeX =
  Columns 1 through 21
     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24
  Column 22
    25
middleRangeValues =
  Columns 1 through 10
       -7.375         -5.5      -1.6667      -1.3333      -1.2857      0.43636         2.35          3.3       4.2857       5.0526
  Columns 11 through 20
          6.2       7.0769       7.2308       7.9167          9.7       10.667       16.167         17.4         19.2         19.6
  Columns 21 through 22
        20.75        24.25

If you don't know a priori what your middle range is, but you know that you want to discard the outliers both at the start and at the end of our curve, and if you have the Statistics Toolbox you can do a robust linear regression to your data using ROBUSTFIT, and only keep the inliers.

M=[-99 -99 -44.5 -7.375 -5.5 -1.666666667 -1.333333333 -1.285714286 0.436363636 2.35 3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4 19.2 19.6 20.75 24.25 34.5 49.5];

%# robust linear regression
x = find(isfinite(M)); %# eliminate NaN or Inf
[u,s]=robustfit(x,M(x));

%# inliers have a weight > 0.25 (raise this value to be stricter)
inlierIdx = s.w > 0.25;
middleRangeX = x(inlierIdx)
middleRangeValues = M(x(inlierIdx))

%# plot with the regression in red and the good values in green
plot(x,M(x),'-b.',x,u(1)+u(2)*x,'r')
hold on,plot(middleRangeX,middleRangeValues,'*r')

the plot

middleRangeX =
  Columns 1 through 21
     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24
  Column 22
    25
middleRangeValues =
  Columns 1 through 10
       -7.375         -5.5      -1.6667      -1.3333      -1.2857      0.43636         2.35          3.3       4.2857       5.0526
  Columns 11 through 20
          6.2       7.0769       7.2308       7.9167          9.7       10.667       16.167         17.4         19.2         19.6
  Columns 21 through 22
        20.75        24.25

回复收藏 0 原文

~没有更多了~