忽略 ggplot2 箱线图中的异常值

发布于 2024-11-01 18:24:49 字数 419 浏览 3 评论 0原文

如何忽略 ggplot2 箱线图中的异常值?我不只是希望它们消失(即 outlier.size=0),但我希望它们被忽略,以便 y 轴缩放以显示第 1/第 3 个百分位数。我的异常值导致“盒子”缩小得几乎是一条线。有一些技术可以解决这个问题吗?

编辑 这是一个示例:

y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")

“在此处输入图像描述"

How would I ignore outliers in ggplot2 boxplot? I don't simply want them to disappear (i.e. outlier.size=0), but I want them to be ignored such that the y axis scales to show 1st/3rd percentile. My outliers are causing the "box" to shrink so small its practically a line. Are there some techniques to deal with this?

Edit
Here's an example:

y = c(.01, .02, .03, .04, .05, .06, .07, .08, .09, .5, -.6)
qplot(1, y, geom="boxplot")

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

醉殇 2024-11-08 18:24:49

使用 geom_boxplot(outlier.shape = NA) 来不显示异常值和 scale_y_continuous(limits = c(lower, upper)) 更改轴限制。

一个例子。

n <- 1e4L
dfr <- data.frame(
  y = exp(rlnorm(n)),  #really right-skewed variable
  f = gl(2, n / 2)
)

p <- ggplot(dfr, aes(f, y)) + 
  geom_boxplot()
p   # big outlier causes quartiles to look too slim

p2 <- ggplot(dfr, aes(f, y)) + 
  geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2  # no outliers plotted, range shifted

实际上,正如 Ramnath 在他的回答中所表明的那样(Andrie也在评论中也表明了这一点),在计算统计数据后通过 coord_cartesian

coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))

(您可能仍然需要使用 scale_y_continuous 来修复轴中断。)

Use geom_boxplot(outlier.shape = NA) to not display the outliers and scale_y_continuous(limits = c(lower, upper)) to change the axis limits.

An example.

n <- 1e4L
dfr <- data.frame(
  y = exp(rlnorm(n)),  #really right-skewed variable
  f = gl(2, n / 2)
)

p <- ggplot(dfr, aes(f, y)) + 
  geom_boxplot()
p   # big outlier causes quartiles to look too slim

p2 <- ggplot(dfr, aes(f, y)) + 
  geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(dfr$y, c(0.1, 0.9)))
p2  # no outliers plotted, range shifted

Actually, as Ramnath showed in his answer (and Andrie too in the comments), it makes more sense to crop the scales after you calculate the statistic, via coord_cartesian.

coord_cartesian(ylim = quantile(dfr$y, c(0.1, 0.9)))

(You'll probably still need to use scale_y_continuous to fix the axis breaks.)

阳光①夏 2024-11-08 18:24:49

这是使用 boxplot.stats 的解决方案

# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))

# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))


# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]

# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)

Here is a solution using boxplot.stats

# create a dummy data frame with outliers
df = data.frame(y = c(-100, rnorm(100), 100))

# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))


# compute lower and upper whiskers
ylim1 = boxplot.stats(df$y)$stats[c(1, 5)]

# scale y limits based on ylim1
p1 = p0 + coord_cartesian(ylim = ylim1*1.05)
难得心□动 2024-11-08 18:24:49

我遇到了同样的问题,并使用 boxplot.stats 预先计算了 Q1、Q2、中位数、ymin、ymax 的值:

# Load package and generate data
library(ggplot2)
data <- rnorm(100)

# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3], 
                 upper=stats[4], ymax=stats[5])

# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin, 
                    ymax=ymax)) + 
    geom_boxplot(stat="identity")
p

结果是没有异常值的箱线图。
输入图像描述这里

I had the same problem and precomputed the values for Q1, Q2, median, ymin, ymax using boxplot.stats:

# Load package and generate data
library(ggplot2)
data <- rnorm(100)

# Compute boxplot statistics
stats <- boxplot.stats(data)$stats
df <- data.frame(x="label1", ymin=stats[1], lower=stats[2], middle=stats[3], 
                 upper=stats[4], ymax=stats[5])

# Create plot
p <- ggplot(df, aes(x=x, lower=lower, upper=upper, middle=middle, ymin=ymin, 
                    ymax=ymax)) + 
    geom_boxplot(stat="identity")
p

The result is a boxplot without outliers.
enter image description here

负佳期 2024-11-08 18:24:49

一种想法是在两遍过程中对数据进行winsorize

  1. 运行第一遍,了解界限是什么,例如在给定百分位处切割,或高于平均值的N个标准差,或...

  2. 在第二遍中,将超出给定界限的值设置为该界限的值

我应该强调,这是一种老式方法,应该由更现代稳健的技术主导,但您仍然会来跨越它很多。

One idea would be to winsorize the data in a two-pass procedure:

  1. run a first pass, learn what the bounds are, e.g. cut of at given percentile, or N standard deviation above the mean, or ...

  2. in a second pass, set the values beyond the given bound to the value of that bound

I should stress that this is an old-fashioned method which ought to be dominated by more modern robust techniques but you still come across it a lot.

诗酒趁年少 2024-11-08 18:24:49

gg.layers::geom_boxplot2 正是您想要的。

# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)

https://rpkgs.github.io/gg.layers/reference/geom_boxplot2。 html
输入图片此处描述

gg.layers::geom_boxplot2 is just what you want.

# remotes::install_github('rpkgs/gg.layers')
library(gg.layers)
library(ggplot2)
p <- ggplot(mpg, aes(class, hwy))
p + geom_boxplot2(width = 0.8, width.errorbar = 0.5)

https://rpkgs.github.io/gg.layers/reference/geom_boxplot2.html
enter image description here

灼疼热情 2024-11-08 18:24:49

如果您想强制胡须延伸到最大值和最小值,您可以调整 coef 参数。 coef 的默认值为 1.5(即晶须的默认长度是 IQR 的 1.5 倍)。

# Load package and create a dummy data frame with outliers 
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))

# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))

# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)

p0 的图像

p1 的图像

If you want to force the whiskers to extend to the max and min values, you can tweak the coef argument. Default value for coef is 1.5 (i.e. default length of the whiskers is 1.5 times the IQR).

# Load package and create a dummy data frame with outliers 
#(using example from Ramnath's answer above)
library(ggplot2)
df = data.frame(y = c(-100, rnorm(100), 100))

# create boxplot that includes outliers
p0 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)))

# create boxplot where whiskers extend to max and min values
p1 = ggplot(df, aes(y = y)) + geom_boxplot(aes(x = factor(1)), coef = 500)

image of p0

image of p1

岁月流歌 2024-11-08 18:24:49

ggplot2 3.5.0 中的新增功能是选项 outliers = FALSE。在计算 y 轴范围时,它不会计算异常值。

library(ggplot2)

p <- ggplot(mpg, aes(class, displ))

# Ignoring outliers
p + geom_boxplot(outliers = FALSE)

如果使用 outlier.shape = NA,您仍然可以将数据范围保留在 y 轴上异常值.alpha = 0

# Hiding outliers
p + geom_boxplot(outlier.shape = NA)

创建于 2024 年 2 月 27 日,使用 reprex v2.1.0

New in ggplot2 3.5.0 is the option outliers = FALSE. It will not count outliers towards calculating the range of the y-axis.

library(ggplot2)

p <- ggplot(mpg, aes(class, displ))

# Ignoring outliers
p + geom_boxplot(outliers = FALSE)

You can still keep the data range on the y-axis if you use outlier.shape = NA or outlier.alpha = 0.

# Hiding outliers
p + geom_boxplot(outlier.shape = NA)

Created on 2024-02-27 with reprex v2.1.0

流年已逝 2024-11-08 18:24:49

简单、肮脏、有效。
geom_boxplot(异常值.alpha = 0)

Simple, dirty and effective.
geom_boxplot(outlier.alpha = 0)

活泼老夫 2024-11-08 18:24:49

geom_boxplot 函数的“coef”选项允许根据四分位数范围更改异常值截止值。此选项记录在函数 stat_boxplot 中。要停用异常值(换句话说,它们被视为常规数据),可以指定一个非常高的截止值,而不是使用默认值 1.5:

library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10)) 
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)

The "coef" option of the geom_boxplot function allows to change the outlier cutoff in terms of interquartile ranges. This option is documented for the function stat_boxplot. To deactivate outliers (in other words they are treated as regular data), one can instead of using the default value of 1.5 specify a very high cutoff value:

library(ggplot2)
# generate data with outliers:
df = data.frame(x=1, y = c(-10, rnorm(100), 10)) 
# generate plot with increased cutoff for outliers:
ggplot(df, aes(x, y)) + geom_boxplot(coef=1e30)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文