R 直方图结果为空图
我是一名初学者 R 程序员,试图绘制具有 100,000 多个观测值的保险索赔数据集的直方图,这些观测值严重倾斜(平均值 = $61,000,中位数 = $20,000,最大值 = $15M)。
我已提交以下代码来在 $0-$100,000 域上绘制 adj_unl_claim 变量的图表:
hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000),
prob=TRUE, breaks=10, col='red')
结果是一个带有轴但没有直方图条的空图 - 只是一个空图。
我怀疑这个问题与我的数据的倾斜性质有关,但我已经尝试了 Break 和 xlim 的每种组合,但没有任何效果。任何解决方案都非常感谢!
I'm a beginner R programmer attempting to plot a histogram of an insurance claims dataset with 100,000+ observations which is heavily skewed (mean=$61,000, median=$20,000, max value=$15M).
I've submitted the following code to graph the adj_unl_claim variable over the $0-$100,000 domain:
hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000),
prob=TRUE, breaks=10, col='red')
with the result being an empty graph with axes but no histogram bars - just an empty graph.
I suspect the problem is related to the skewed nature of my data, but I've tried every combination of breaks and xlim and nothing works. Any solutions are much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您设置 freq = FALSE,那么您将获得概率密度的直方图。这些可能远小于 1。因此,您的直方图条可能沿 x 轴打印得非常小。在不设置 ylim 的情况下重试,R 将自动计算合理的 y 轴限制。
另请注意,设置 xlim 不会更改实际绘图,只会更改您看到的绘图数量。因此,如果绘图中的某些断点超出了 100000 个限制,您实际上可能看不到 10 个断点。您实际上可能希望首先对数据进行子集化以排除超过 100000 的值,然后对缩减后的数据集绘制直方图以获得所需的图。也许,我不确定你来这里的目的是什么。
If you've set freq = FALSE, then you are getting a histogram of probability densities. These are likely much less than 1. Consequently, your histogram bars are probably printed super-tiny along the x-axis. Try again without setting the ylim, and R will automatically calculate reasonable y axis limits.
Note also that setting the xlim doesn't change the actual plot, just how much of it you see. So you might not actually see 10 breaks, if some of them fall beyond the 100000 limit in your plot. You might actually want to subset your data to exclude values over 100000 first, and then do a histogram on the reduced dataset to get the plot you want. Maybe, I'm not sure what your objective is here.
使用泰勒的一些建议,这可能会给你一些可以玩的东西。
其中
hist
产生类似的问题是,尽管直方图涵盖了大约 82该伪数据中索赔的 %,仅涵盖索赔价值的约 31%。因此,除非您想要提出的唯一一点是大多数声明都很小,否则您可能需要考虑不同的图表。
我的猜测是,您的数据的真正要点是,虽然大多数索赔规模相当小,但大部分成本都在大额索赔中。即使您扩大范围,重大声明也不会显示在直方图中。相反,将索赔分成不同宽度的组,包括例如 0-1000 美元和 100 万美元以上,并用点图显示 (a) 索赔的比例属于每个组,以及 (b) 索赔价值的比例属于哪个组进入每个组。
This might give you something to play with, using some of Tyler's suggestions.
where
hist
produced something likeThe problem with this is that although the histogram coves about 82% of the claims in this pseudo-data, it only covers about 31% of the value of the claims. So unless the only point you want to make is that most claims are small, you might want to consider a different graph.
My guess is that the real point from your data is that while most claims are fairly small, most of the cost is in the big claims. The big claims will not show up in a histogram, even if you extend the scale. Instead break the claims up into groups of differing widths, including for example 0-$1000 and $1M+, and show with a dot plot (a) what proportion of claims fall into each group and (b) what proportion of the values of claims fall into each group.
需要尝试两件事:
绘制所有金额低于 10 万美元的索赔的直方图。为了显示大量数据,省略了尾部。或者,
将对您的索赔大小进行对数转换,有效地恢复长尾。
Two things to try:
will plot a histogram of all claims of less than $100k. This omits the tail in the interest of showing the bulk of the data. Alternatively,
will log-transform your claim size, effectively bringing the long tail back in.
谢谢,对我的数据进行子集化就成功了。我还添加了两行代码,用于计算每个直方图箱中观察值的比例,然后用特定的 y 和 x 子集将它们绘制出来:
Thanks, subsetting my data did the trick. I also added two lines of code that calculate the proportion of observations in each histogram bin and then plots them out with specific y and x subsets: