寻找合适的截止值
我尝试实施 Hampel tanh 估计器 用于标准化高度不对称的数据。为此,我需要执行以下计算:
给定 x
- 数字的排序列表和 m
- x
的中位数,我需要找到 a
,使得 x
中大约 70% 的值落入 (ma; m+a)
范围内。我们对x
中值的分布一无所知。我使用 numpy 在 python 中编写,我最好的想法是编写某种随机迭代搜索(例如,如 Solis 和 Wets),但我怀疑有更好的方法,无论是更好的算法形式还是作为现成函数。我搜索了 numpy 和 scipy 文档,但找不到任何有用的提示。
编辑
赛斯 建议使用 scipy.stats.mstats.trimboth,但是在我对倾斜分布的测试中,此建议不起作用:
from scipy.stats.mstats import trimboth
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
输出为 0.79 (~80%,而不是 70)
I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:
Given x
- a sorted list of numbers and m
- the median of x
, I need to find a
such that approximately 70% of the values in x
fall into the range (m-a; m+a)
. We know nothing about the distribution of values in x
. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.
EDIT
Seth suggested to use scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:
from scipy.stats.mstats import trimboth
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
The output is 0.79 (~80%, instead of 70)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您首先需要通过将所有小于平均值的值向右折叠来使分布对称。然后,您可以在此单侧分布上使用标准 scipy.stats 函数:
这会根据需要提供
0.7
的结果。You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard
scipy.stats
functions on this one-sided distribution:This gives the result of
0.7
as required.稍微重述一下问题。您知道列表的长度以及要考虑的列表中数字的比例。鉴于此,您可以确定列表中第一个和最后一个索引之间的差异,从而为您提供所需的范围。然后的目标是找到能够最小化与中位数所需对称值相对应的成本函数的索引。
设较小的索引为
n1
,较大的索引为n2
;这些都不是独立的。列表中索引处的值为x[n1] = mb
和x[n2]=m+c
。您现在要选择n1
(因此选择n2
),以便b
和c
尽可能接近。当(b - c)**2
最小时会发生这种情况。使用 numpy.argmin 非常简单。与问题中的示例类似,这里有一个交互式会话,说明了该方法:Restate the problem slightly. You know the length of the list, and what fraction of the numbers in the list to consider. Given that, you can determine the difference between the first and last indices in the list that give you the desired range. The goal then is to find the indices that will minimize a cost function corresponding to the desired symmetric values about the median.
Let the smaller index be
n1
and the larger index byn2
; these are not independent. The values from the list at the indices arex[n1] = m-b
andx[n2]=m+c
. You now want to choosen1
(and thusn2
) so thatb
andc
are as close as possible. This occurs when(b - c)**2
is minimal. That's pretty easy usingnumpy.argmin
. Paralleling the example in the question, here's an interactive session illustrating the approach:你想要的是 scipy.stats.mstats.trimboth。设置
proportiontocut=0.15
。修剪后,取(max-min)/2
。What you want is scipy.stats.mstats.trimboth. Set
proportiontocut=0.15
. After trimming, take(max-min)/2
.