中位数的中位数 - 这可能吗还是有不同的方法
目前,我每天都会汇总大量数据,并且每天都会计算当前值的中值。现在我需要将所有这些每日结果汇总到每月的基础上,当然我需要再次计算中位数。
有没有办法计算中位数并使其在统计上正确?我想避免再次使用原始数据,因为它的数量很大:)
作为一个小的概念证明,我制作了这个 javascript - 也许它有助于找到一种方法:
var aSortedNumberGroups = [];
var aSortedNumbers = [];
var aMedians = [];
Math.median = function(aData)
{
var fMedian = 0;
var iIndex = Math.floor(aData.length/2);
if (!(aData.length%2)) {
fMedian = (aData[iIndex-1]+aData[iIndex])/2;
} else {
fMedian = aData[iIndex];
}
return fMedian;
};
for (var iCurrGroupNum = 0; iCurrGroupNum < 5; ++iCurrGroupNum) {
var aCurrNums = [];
for (var iCurrNum = 0; iCurrNum < 1000; ++iCurrNum) {
var iCurrRandomNumber = Math.floor(Math.random()*10001);
aCurrNums.push(iCurrRandomNumber);
aSortedNumbers.push(iCurrRandomNumber);
}
aCurrNums.sort(function(oCountA,oCountB) {
return (iNumA < iNumB) ? -1 : 1;
});
aSortedNumberGroups.push(aCurrNums);
aMedians.push(Math.median(aCurrNums));
}
console.log("Medians of each group: "+JSON.stringify(aMedians, null, 4));
console.log("Median of medians: "+Math.median(aMedians));
console.log("Median of all: "+Math.median(aSortedNumbers));
正如你将看到的,通常有一个巨大的上限在所有原始数字的中位数和中位数的中位数之间,我喜欢让它彼此非常接近。
多谢!
Currently i am aggregating big amount of data on a daily basis and for each day i am calculating a median of the current values. Now i need to aggregate all this daily results into a monthly basis and of course i need to calculate the median again.
Is there a way to calculate a median of medians and have it statistically correct? I want to avoid to use the raw data again, because it is a huge amount of it :)
As a small proof of concept i made this javascript - maybe it helps to find a way:
var aSortedNumberGroups = [];
var aSortedNumbers = [];
var aMedians = [];
Math.median = function(aData)
{
var fMedian = 0;
var iIndex = Math.floor(aData.length/2);
if (!(aData.length%2)) {
fMedian = (aData[iIndex-1]+aData[iIndex])/2;
} else {
fMedian = aData[iIndex];
}
return fMedian;
};
for (var iCurrGroupNum = 0; iCurrGroupNum < 5; ++iCurrGroupNum) {
var aCurrNums = [];
for (var iCurrNum = 0; iCurrNum < 1000; ++iCurrNum) {
var iCurrRandomNumber = Math.floor(Math.random()*10001);
aCurrNums.push(iCurrRandomNumber);
aSortedNumbers.push(iCurrRandomNumber);
}
aCurrNums.sort(function(oCountA,oCountB) {
return (iNumA < iNumB) ? -1 : 1;
});
aSortedNumberGroups.push(aCurrNums);
aMedians.push(Math.median(aCurrNums));
}
console.log("Medians of each group: "+JSON.stringify(aMedians, null, 4));
console.log("Median of medians: "+Math.median(aMedians));
console.log("Median of all: "+Math.median(aSortedNumbers));
As you will see there is often a huge cap between the median of all raw numbers and the median of medians and i like to have it pretty close to each other.
Thanks alot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您实际上并没有“计算”中位数,而是通过重新分配到子集中“发现”它,唯一的优化是可重新加载的“刻度图”或运行计数:例如,以这种方式存储每个出现的次数及其发生的次数可以重新创建分布,而无需实际重新解析原始数据。这只是一个小的优化,但根据相关数据集的重复情况,您可以节省大量的 MB 数据,并且至少节省大量处理器周期。
用 JSON 来思考:
{ '1': 3, '5': 12, '7': 4 }
规范:'1' 出现了 3 次,'5' 出现了 12 次,等等...然后将这些计数保留在您想要获取中位数的时间段开始时的开始位置。
希望这有帮助-ck
you don't actually "calculate" a median you "discover" it through redistribution into subsets, the only optimization for this is a reloadable "tick chart" or running tally: e.g. store each occurrence with the number of times it occurred this way you can recreate the distribution without actually having to reparse the raw data. This is only a small optimization, but depending on the repetition of the data set in question you could save yourself tons of MB and at the very least a bunch of processor cycles.
think of it in JSON:
{ '1': 3, '5': 12, '7': 4 }
canonical: '1' has occurred 3 times, '5' has occurred 12 times, etc...then persist those counts for the starting at the beginning of time period in which you want to get a median for.
hope this helps -ck
不,不幸的是,没有一种方法可以根据整体子集的中位数来计算中位数,并且在统计上仍然准确。但是,如果您想计算平均值,则可以使用子集的平均值,前提是它们大小相等。
ck上面的优化可能对你有帮助。
No, unfortunately there is not a way to calculate the median based on medians of subsets of the whole and still be statistically accurate. If you wanted to calculate the mean, however, you could use the means of subsets, given that they are of equal size.
ck's optimization above could be of assistance to you.
我知道这是一个非常过时的线程,但未来的读者可能会发现 Tukey 的第九种方法非常相关......此处分析:http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/
-kg
I know this is a very dated thread, but future readers may find Tukey's Ninther method quite relevant ... analysis here: http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/
-kg
另一种方法是获取每天的数据,对其进行解析,然后按排序顺序存储。对于某一天,您只需查看数据的中位数即可得到答案。
在月底,您可以快速选择以查找中位数。您可以利用每天数据的排序顺序进行二分搜索来拆分它。结果是您的月末处理速度将非常非常快。
相同类型的数据,以相同的方式组织,也可以让您非常便宜地完成各种百分位数。唯一困难的部分是提取每天的原始数据并对其进行排序。
Yet another approach is to take each day's data, parse it, and store it in sorted order. For a given day you can just look at the median piece of data and you've got your answer.
At the end of the month you can do a quick-select to find the median. You can take advantage of the sorted order of each day's data to do a binary search to split it. The result is that your end of month processing will be very, very quick.
The same kind of data, organized in the same kind of way, will also let you do various percentiles very cheaply. The only hard part is extracting each day's raw data and sorting it.