寻找合适的截止值

发布于 2024-10-20 15:48:56 字数 1403 浏览 7 评论 0原文

我尝试实施 Hampel tanh 估计器用于标准化高度不对称的数据。为此，我需要执行以下计算：

给定 x - 数字的排序列表和 m - x 的中位数，我需要找到 a，使得 x 中大约 70% 的值落入 (ma; m+a) 范围内。我们对x中值的分布一无所知。我使用 numpy 在 python 中编写，我最好的想法是编写某种随机迭代搜索（例如，如 Solis 和 Wets），但我怀疑有更好的方法，无论是更好的算法形式还是作为现成函数。我搜索了 numpy 和 scipy 文档，但找不到任何有用的提示。

编辑

赛斯建议使用 scipy.stats.mstats.trimboth，但是在我对倾斜分布的测试中，此建议不起作用：

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

输出为 0.79 （~80%，而不是 70）

原文

I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:

Given x - a sorted list of numbers and m - the median of x, I need to find a such that approximately 70% of the values in x fall into the range (m-a; m+a). We know nothing about the distribution of values in x. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.

EDIT

Seth suggested to use scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:

from scipy.stats.mstats import trimboth
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

The output is 0.79 (~80%, instead of 70)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把昨日还给我 2024-10-27 15:48:56

您首先需要通过将所有小于平均值的值向右折叠来使分布对称。然后，您可以在此单侧分布上使用标准 scipy.stats 函数：

from scipy.stats import scoreatpercentile
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

oneSidedList = theList[:]               # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]

# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

这会根据需要提供 0.7 的结果。

You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard scipy.stats functions on this one-sided distribution:

from scipy.stats import scoreatpercentile
import numpy as np

theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)

oneSidedList = theList[:]               # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]

# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian

#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))

print np.sum(sel) / float(len(theList))

This gives the result of 0.7 as required.

回复收藏 0 原文

打小就很酷 2024-10-27 15:48:56

稍微重述一下问题。您知道列表的长度以及要考虑的列表中数字的比例。鉴于此，您可以确定列表中第一个和最后一个索引之间的差异，从而为您提供所需的范围。然后的目标是找到能够最小化与中位数所需对称值相对应的成本函数的索引。

设较小的索引为n1，较大的索引为n2；这些都不是独立的。列表中索引处的值为 x[n1] = mb 和 x[n2]=m+c。您现在要选择 n1（因此选择 n2），以便 b 和 c 尽可能接近。当 (b - c)**2 最小时会发生这种情况。使用 numpy.argmin 非常简单。与问题中的示例类似，这里有一个交互式会话，说明了该方法：

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

Restate the problem slightly. You know the length of the list, and what fraction of the numbers in the list to consider. Given that, you can determine the difference between the first and last indices in the list that give you the desired range. The goal then is to find the indices that will minimize a cost function corresponding to the desired symmetric values about the median.

Let the smaller index be n1 and the larger index by n2; these are not independent. The values from the list at the indices are x[n1] = m-b and x[n2]=m+c. You now want to choose n1 (and thus n2) so that b and c are as close as possible. This occurs when (b - c)**2 is minimal. That's pretty easy using numpy.argmin. Paralleling the example in the question, here's an interactive session illustrating the approach:

$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)

回复收藏 0 原文