R 直方图结果为空图

发布于 2024-10-26 04:49:31 字数 407 浏览 3 评论 0原文

我是一名初学者 R 程序员，试图绘制具有 100,000 多个观测值的保险索赔数据集的直方图，这些观测值严重倾斜（平均值 = $61,000，中位数 = $20,000，最大值 = $15M）。

我已提交以下代码来在 $0-$100,000 域上绘制 adj_unl_claim 变量的图表：

hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000), 
     prob=TRUE, breaks=10, col='red')

结果是一个带有轴但没有直方图条的空图 - 只是一个空图。

我怀疑这个问题与我的数据的倾斜性质有关，但我已经尝试了 Break 和 xlim 的每种组合，但没有任何效果。任何解决方案都非常感谢！

原文

I'm a beginner R programmer attempting to plot a histogram of an insurance claims dataset with 100,000+ observations which is heavily skewed (mean=$61,000, median=$20,000, max value=$15M).

I've submitted the following code to graph the adj_unl_claim variable over the $0-$100,000 domain:

hist(test$adj_unl_claim, freq=FALSE, ylim=c(0,1), xlim=c(0,100000), 
     prob=TRUE, breaks=10, col='red')

with the result being an empty graph with axes but no histogram bars - just an empty graph.

I suspect the problem is related to the skewed nature of my data, but I've tried every combination of breaks and xlim and nothing works. Any solutions are much appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

远山浅 2024-11-02 04:49:31

如果您设置 freq = FALSE，那么您将获得概率密度的直方图。这些可能远小于 1。因此，您的直方图条可能沿 x 轴打印得非常小。在不设置 ylim 的情况下重试，R 将自动计算合理的 y 轴限制。

另请注意，设置 xlim 不会更改实际绘图，只会更改您看到的绘图数量。因此，如果绘图中的某些断点超出了 100000 个限制，您实际上可能看不到 10 个断点。您实际上可能希望首先对数据进行子集化以排除超过 100000 的值，然后对缩减后的数据集绘制直方图以获得所需的图。也许，我不确定你来这里的目的是什么。

回复收藏 0 原文

旧故 2024-11-02 04:49:31

使用泰勒的一些建议，这可能会给你一些可以玩的东西。

> claim <- c(15000000, rexp(99999, rate = 1/400)^1.76) 
> summary(claim)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0     4261    20080    61730    67790 15000000 
> 
> hs    <- 100000     # highest value to show on histogram
> br    <- 10         # number of bars to show on histogram
> 
> hist(claim, xlim = c(0,hs), freq = FALSE, breaks = br*max(claim)/hs, col='red')
> 
> length(claim[claim<hs]) / length(claim) #proportion of claims shown
[1] 0.82267
> sum(claim[claim<hs])    / sum(claim)    #proportion of value shown
[1] 0.3057994

其中 hist 产生类似

Claim histogram

的问题是，尽管直方图涵盖了大约 82该伪数据中索赔的 %，仅涵盖索赔价值的约 31%。因此，除非您想要提出的唯一一点是大多数声明都很小，否则您可能需要考虑不同的图表。

我的猜测是，您的数据的真正要点是，虽然大多数索赔规模相当小，但大部分成本都在大额索赔中。即使您扩大范围，重大声明也不会显示在直方图中。相反，将索赔分成不同宽度的组，包括例如 0-1000 美元和 100 万美元以上，并用点图显示 (a) 索赔的比例属于每个组，以及 (b) 索赔价值的比例属于哪个组进入每个组。

This might give you something to play with, using some of Tyler's suggestions.

> claim <- c(15000000, rexp(99999, rate = 1/400)^1.76) 
> summary(claim)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0     4261    20080    61730    67790 15000000 
> 
> hs    <- 100000     # highest value to show on histogram
> br    <- 10         # number of bars to show on histogram
> 
> hist(claim, xlim = c(0,hs), freq = FALSE, breaks = br*max(claim)/hs, col='red')
> 
> length(claim[claim<hs]) / length(claim) #proportion of claims shown
[1] 0.82267
> sum(claim[claim<hs])    / sum(claim)    #proportion of value shown
[1] 0.3057994

where hist produced something like

Claim histogram

The problem with this is that although the histogram coves about 82% of the claims in this pseudo-data, it only covers about 31% of the value of the claims. So unless the only point you want to make is that most claims are small, you might want to consider a different graph.

My guess is that the real point from your data is that while most claims are fairly small, most of the cost is in the big claims. The big claims will not show up in a histogram, even if you extend the scale. Instead break the claims up into groups of differing widths, including for example 0-$1000 and $1M+, and show with a dot plot (a) what proportion of claims fall into each group and (b) what proportion of the values of claims fall into each group.

回复收藏 0 原文

遗忘曾经 2024-11-02 04:49:31

需要尝试两件事：

hist(test$adj_unl_claim[test$adj_unl_claim < 100000])

绘制所有金额低于 10 万美元的索赔的直方图。为了显示大量数据，省略了尾部。或者，

hist(log(test$adj_unl_claim))

将对您的索赔大小进行对数转换，有效地恢复长尾。

Two things to try:

hist(test$adj_unl_claim[test$adj_unl_claim < 100000])

will plot a histogram of all claims of less than $100k. This omits the tail in the interest of showing the bulk of the data. Alternatively,

hist(log(test$adj_unl_claim))

will log-transform your claim size, effectively bringing the long tail back in.

回复收藏 0 原文

許願樹丅啲祈禱 2024-11-02 04:49:31

谢谢，对我的数据进行子集化就成功了。我还添加了两行代码，用于计算每个直方图箱中观察值的比例，然后用特定的 y 和 x 子集将它们绘制出来：

k<-hist(gb2_agg$adj_unl_claim,prob=TRUE,breaks=100000)
k$counts<-k$counts/sum(k$counts)
plot(k,ylim=c(0,.02),xlim-c(0,50000),col='blue')

Thanks, subsetting my data did the trick. I also added two lines of code that calculate the proportion of observations in each histogram bin and then plots them out with specific y and x subsets:

k<-hist(gb2_agg$adj_unl_claim,prob=TRUE,breaks=100000)
k$counts<-k$counts/sum(k$counts)
plot(k,ylim=c(0,.02),xlim-c(0,50000),col='blue')

回复收藏 0 原文

~没有更多了~

关于作者

温柔女人霸气范

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

R 直方图结果为空图

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

R 直方图结果为空图

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。