在Excel中使用“桶”计算百分位数data 而不是数据列表本身
我在 Excel 中有一堆数据,我需要从中获取某些百分位信息。问题是,我没有由每个值组成的数据集,而是有有关数据数量或“桶”数据的信息。
例如,假设我的实际数据集如下所示: 1,1,2,2,2,2,3,3,4,4,4
我拥有的数据集是这样的:
Value No. of occurrences
1 2
2 4
3 2
4 3
对我来说有一个简单的方法吗计算百分位数信息(以及中位数)而无需将摘要数据分解为完整数据集? (一旦我这样做了,我知道我可以使用 Percentile(A1:A5, p) 函数)
这很重要,因为我的数据集非常大。如果我将数据分解出来,我将拥有数十万行,并且我将不得不为几百个数据执行此操作 套。
帮助!
I have a bunch of data in Excel that I need to get certain percentile information from. The problem is that instead of having the data set made up of each value, I instead have info on the number of or "bucket" data.
For example, imagine that my actual data set looks like this: 1,1,2,2,2,2,3,3,4,4,4
The data set that I have is this:
Value No. of occurrences
1 2
2 4
3 2
4 3
Is there an easy way for me to calculate percentile information (as well as the median) without having to explode the summary data out to full data set? (Once I did that, I know that I could just use the Percentile(A1:A5, p) function)
This is important because my data set is very large. If I exploded the data out, I would have hundreds of thousands of rows and I would have to do it for a couple of hundred data
sets.
Help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的示例中的中位数非常简单,因为您显示的总数为奇数。出现次数。通过观察,2中的中位数。相对于最后2(序列中的第六个值)有五个值小于或等于[1,1,2,2,2]和五个值大于或等于[3, 3,4,4,4]。
这可以使用诸如
=(1+SUM(No._of_occurrences))/2
之类的公式根据您的摘要数据进行计算,其中No._of_occurrences
是包含以下内容的命名范围:出现次数的数组 [2,4,2,3]。具有偶数个数据点的数据集没有中位数,因此添加一个数据点(例如 4 个)的任何结果都是可疑的。在这种情况下,公式将返回 6.5,其中一半表示结果无效(有两个中间值)。不过,如果采用相当传统的方法对这两个值求平均值,则公式结果可以解释为第六个 [2] 和第七个 [3] 值的平均值,即 2.5。
将分箱出现次数的各个值乘以 100 再除以总出现次数 [11],即可得出每个分箱占总数的百分比。这些的累积总数给出了每个箱的上限的百分位数。假设较低的 30 个百分位数,它出现在第二个容器中,因此在本例中为 2。较低的 20 个百分位数和第 50 个百分位数(中位数)位于同一个容器中,因此对它们来说,答案也是 2。
这有效,因为您选择了每个数据点值一个容器。如果像更常见的那样,这些是范围(例如 1-5、6-10 等),那么较低的 20% 和 50% 的百分位数可能仍位于同一个容器中,但不一定具有相同的值。但是,要确定该值,只需进一步检查该箱的内容即可确定确切的值,而不是整个数据集。
The median in your example is quite easy as you are showing an odd number of total
No. of occurrences
. By observation, the median in 2. Relative to the last 2 (the sixth value in the sequence) there are five values less than or equal [1,1,2,2,2] and five values greater than or equal [3,3,4,4,4].This can be calculated from your summary data with a formula such as
=(1+SUM(No._of_occurrences))/2
whereNo._of_occurences
is the named range containing the array of your No. of occurrences [2,4,2,3].A data set with an even number of datapoints does not have a median so any result from adding one datapoint (say 4) is suspect. The formula would return 6.5 in that case, with the half indicating an invalid result (there are two middle values). Though if taking a fairly conventional approach of averaging these two values, then the formula result can be interpreted as the mean of the sixth [2] and seventh [3] values – ie 2.5.
Individual values for your binned No. of occurrences multiplied by 100 and divided by the total No. of occurrences [11] would give the percentages each bin contributes to the total. A cumulative total of these gives the percentile for the upper limit of each bin. Taking say the lower 30th percentile, this arises in the second bin, hence in this case is 2. The lower 20th and the 50th percentile (median) are in the same bin so for them the answer is also 2.
This works because you chose one bin per data point value. Had these, as is more usual, been ranges (say 1-5, 6-10 etc) then the lower 20th and the 50th percentile may still have been in the same bin but would not necessarily have had the same value. However, to determine the value only the contents of that bin would require further examination to determine the exact value, rather than your entire dataset.