在ggplot2中,箱线图线的末尾代表什么?

发布于 2024-10-16 12:37:37 字数 196 浏览 6 评论 0原文

我找不到箱线图线条端点代表什么的描述。

例如,以下是线条结束处上方和下方的点值。 在此处输入图像描述

(我意识到盒子的顶部和底部是第 25 个和第 75 个百分位,中心线是第 50 日)。我认为,由于线上方和下方有点,它们并不代表最大/最小值。

I can't find a description of what the end points of the lines of a boxplot represent.

For example, here are point values above and below where the lines end.
enter image description here

(I realize that the top and bottom of the box are 25th and 75th percentile, and the centerline is the 50th). I assume, as there are points above and below the lines that they do not represent the max/min values.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

病毒体 2024-10-23 12:37:37

箱线图末尾的“点”代表异常值。确定一个点是否为异常值有许多不同的规则,但 R 和 ggplot 使用的方法是“1.5 规则”。如果数据点为:

  • 小于 Q1 - 1.5*IQR
  • 大于 Q3 + 1.5*IQR

,则该点被归类为“异常值”。晶须定义为:

上晶须 = min(max(x), Q_3 + 1.5 * IQR)

下晶须 = max(min(x), Q_1 – 1.5 * IQR)

其中 IQR = Q_3 – Q_1,盒子长度。因此上须线位于最大 x 值和 Q_3 + 1.5 IQR 中较小的位置,
而下须线位于最小 x 值和 Q_1 – 1.5 IQR 中较大的位置。

其他信息

  • 请参阅维基百科箱线图页面了解替代异常值规则。
  • 实际上有多种计算分位数的方法。查看`?quantile 来了解九种不同方法的描述。

示例

考虑以下示例

> set.seed(1)
> x = rlnorm(20, 1/2)#skewed data
> par(mfrow=c(1,3))
> boxplot(x, range=1.7, main="range=1.7")
> boxplot(x, range=1.5, main="range=1.5")#default
> boxplot(x, range=0, main="range=0")#The same as range="Very big number"

这给出了以下图:
在此处输入图像描述

当我们将范围从 1.7 减小到 1.5 时,我们减小了晶须的长度。但是,range=0 是一种特殊情况 - 它相当于“range=infinity”

The "dots" at the end of the boxplot represent outliers. There are a number of different rules for determining if a point is an outlier, but the method that R and ggplot use is the "1.5 rule". If a data point is:

  • less than Q1 - 1.5*IQR
  • greater than Q3 + 1.5*IQR

then that point is classed as an "outlier". The whiskers are defined as:

upper whisker = min(max(x), Q_3 + 1.5 * IQR)

lower whisker = max(min(x), Q_1 – 1.5 * IQR)

where IQR = Q_3 – Q_1, the box length. So the upper whisker is located at the smaller of the maximum x value and Q_3 + 1.5 IQR,
whereas the lower whisker is located at the larger of the smallest x value and Q_1 – 1.5 IQR.

Additional information

  • See the wikipedia boxplot page for alternative outlier rules.
  • There are actually a variety of ways of calculating quantiles. Have a look at `?quantile for the description of the nine different methods.

Example

Consider the following example

> set.seed(1)
> x = rlnorm(20, 1/2)#skewed data
> par(mfrow=c(1,3))
> boxplot(x, range=1.7, main="range=1.7")
> boxplot(x, range=1.5, main="range=1.5")#default
> boxplot(x, range=0, main="range=0")#The same as range="Very big number"

This gives the following plot:
enter image description here

As we decrease range from 1.7 to 1.5 we reduce the length of the whisker. However, range=0 is a special case - it's equivalent to "range=infinity"

誰ツ都不明白 2024-10-23 12:37:37

我认为 ggplot 使用标准默认值,与 boxplot 相同:“晶须延伸到最极端的数据点,该数据点距框的长度不超过框的 [1.5] 倍”

请参阅: boxplot.stats

I think ggplot using the standard defaults, the same as boxplot: "the whiskers extend to the most extreme data point which is no more than [1.5] times the length of the box away from the box"

See: boxplot.stats

皇甫轩 2024-10-23 12:37:37

P1IMSA 教程 8 - 了解箱线图视频提供了一个直观的步骤 - (Tukey) 箱线图和须线图的逐步解释。

在 4 分 23 秒处,我解释了晶须末端的含义及其与 1.5*IQR 的关系。

尽管视频中显示的图表是使用 D3.js 而不是 R 渲染的,但其解释与提到的箱线图的 R 实现一致。

P1IMSA Tutorial 8 - Understanding Box and Whisker Plots video offers a visual step-by-step explanation of (Tukey) box and whisker plots.

At 4m 23s I explain the meaning of the whisker ends and its relationship to the 1.5*IQR.

Although the chart shown in the video was rendered using D3.js rather than R, its explanations jibe with the R implementations of boxplots mentioned.

烟沫凡尘 2024-10-23 12:37:37

正如 @TemplateRex 在评论中强调的那样,ggplot 不会在上/下四分位数加/减 1.5 倍 IQR 处绘制胡须。它实际上在 max(x[x < Q3 + 1.5 * IQR]) 和 min(x[x > Q1 + 1.5 * IQR]) 处绘制它们。例如,这是使用 geom_boxplot 绘制的图,其中我在值 Q1 - 1.5*IQR 处添加了一条虚线:

在此处输入图像描述

Q1 = 52

Q3 = 65

Q1 - 1.5 * IQR = 52 - 13 *1.5 = 32.5(虚线)

下须线 = min(x[x > Q1 + 1.5 * IQR]) = 35(其中 x 是用于创建箱线图的数据,异常值位于 x = 27 处)。

MWE
请注意,这不是我用来生成上面图像的确切代码,但它已经说明了要点。

library("mosaic") # For favstats()

df <-  c(54, 41, 55, 66, 71, 50, 65, 54, 72, 46, 36, 64, 49, 64, 73, 
         52, 53, 66, 49, 64, 44, 56, 49, 54, 61, 55, 52, 64, 60, 54, 59, 
         67, 58, 51, 63, 55, 67, 68, 54, 53, 58, 26, 53, 56, 61, 51, 51, 
         50, 51, 68, 60, 67, 66, 51, 60, 52, 79, 62, 55, 74, 62, 59, 35, 
         67, 58, 74, 48, 53, 40, 62, 67, 57, 68, 56, 75, 55, 41, 50, 73, 
         57, 62, 61, 48, 60, 64, 53, 53, 66, 58, 51, 68, 69, 69, 58, 54, 
         57, 65, 78, 70, 52, 59, 52, 65, 70, 53, 57, 72, 47, 50, 70, 41, 
         64, 59, 58, 65, 57, 60, 70, 46, 40, 76, 60, 64, 51, 38, 67, 57, 
         64, 51)
df <- as.data.frame(df)


Q1 <- favstats(df)$Q1
Q3 <- favstats(df)$Q3

IQR <- Q3 - Q1

lowerlim <- Q1 - 1.5*IQR
upperlim <- Q3 + 1.5* IQR

boxplot_Tukey_lower <- min(df[df > lowerlim])
boxplot_Tukey_upper <- max(df[df < upperlim])



ggplot(df, aes(x = "", y = df)) +
  stat_boxplot(geom ='errorbar', width = 0.5) +
  geom_boxplot() + 
  geom_hline(yintercept = lowerlim, linetype = "dashed") +
  geom_hline(yintercept = upperlim, linetype = "dashed")

As highlighted by @TemplateRex in a comment, ggplot doesn't draw the whiskers at the upper/lower quartile plus/minus 1.5 times the IQR. It actually draws them at max(x[x < Q3 + 1.5 * IQR]) and min(x[x > Q1 + 1.5 * IQR]). For example, here is a plot drawn using geom_boxplot where I've added a dashed line at the value Q1 - 1.5*IQR:

enter image description here

Q1 = 52

Q3 = 65

Q1 - 1.5 * IQR = 52 - 13*1.5 = 32.5 (dashed line)

Lower whisker = min(x[x > Q1 + 1.5 * IQR]) = 35 (where x is the data used to create the boxplot, outlier is at x = 27).

MWE
Note this isn't the exact code I used to produce the image above but it gets the point over.

library("mosaic") # For favstats()

df <-  c(54, 41, 55, 66, 71, 50, 65, 54, 72, 46, 36, 64, 49, 64, 73, 
         52, 53, 66, 49, 64, 44, 56, 49, 54, 61, 55, 52, 64, 60, 54, 59, 
         67, 58, 51, 63, 55, 67, 68, 54, 53, 58, 26, 53, 56, 61, 51, 51, 
         50, 51, 68, 60, 67, 66, 51, 60, 52, 79, 62, 55, 74, 62, 59, 35, 
         67, 58, 74, 48, 53, 40, 62, 67, 57, 68, 56, 75, 55, 41, 50, 73, 
         57, 62, 61, 48, 60, 64, 53, 53, 66, 58, 51, 68, 69, 69, 58, 54, 
         57, 65, 78, 70, 52, 59, 52, 65, 70, 53, 57, 72, 47, 50, 70, 41, 
         64, 59, 58, 65, 57, 60, 70, 46, 40, 76, 60, 64, 51, 38, 67, 57, 
         64, 51)
df <- as.data.frame(df)


Q1 <- favstats(df)$Q1
Q3 <- favstats(df)$Q3

IQR <- Q3 - Q1

lowerlim <- Q1 - 1.5*IQR
upperlim <- Q3 + 1.5* IQR

boxplot_Tukey_lower <- min(df[df > lowerlim])
boxplot_Tukey_upper <- max(df[df < upperlim])



ggplot(df, aes(x = "", y = df)) +
  stat_boxplot(geom ='errorbar', width = 0.5) +
  geom_boxplot() + 
  geom_hline(yintercept = lowerlim, linetype = "dashed") +
  geom_hline(yintercept = upperlim, linetype = "dashed")

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文