geom_boxplot 样本 ID 中的离群值形状

发布于 2025-01-16 18:43:09 字数 2335 浏览 6 评论 0原文

如何修改 geom_boxplot 中异常值的形状以随着时间的推移匹配样本 ID。想象一下我有这种数据（这只是虚拟数据，代码可能不漂亮，但这就是我想出的）：

# create dummy data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

# mark outliers in table
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ TRUE,
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ TRUE,
                                TRUE ~ FALSE))

这会产生以下图：

ggplot(df, aes(x = Time,
               y = Value,
               label = Time)) +
  geom_boxplot(outlier.colour =  'red',
               outlier.shape = 1,
               outlier.size = 2
  ) +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

目标：< /strong>

我什么我想要的是对于每个组 A 或 B 我可以看到异常值是否相同。例如，在 A T0 中显示的异常值与 A T1 中显示的异常值相同。更具体地说，A T0 中被视为圆形的异常值应该是 A T1 中的圆形，并且第二个异常值A 中的 T1 应该是任何其他形状（例如三角形）。由于我的原始数据大约有 5/6 个时间点，因此很高兴通过查看绘图来了解异常值是否仍然是异常值。在某些情况下，我的原始数据集大约有 5-8 个异常值。

在组 B 中，我们可以重复使用与组 A 中相同的形状，尽管我们的样本 ID 与组 A 中不同。

我想使用基本形状，如三角形、圆形、Asterix 等（我知道形状是有限的，但对于我的数据集类型来说应该足够了）。我也知道我可以标记数据点，但我不想要。不同的颜色也可以，但我更喜欢不同的形状。

我想我必须单独计算异常值，然后可能将 geom_point 与 aes(shape = df$Sample) 或其他东西一起使用。但我无法弄清楚。

有人根据我的虚拟数据有提示或解决方案吗？那太棒了:-)

最好的TMC

原文

How can I modify the shape of the outliers in geom_boxplot to match the sample ID over time.
Imagine I have this kind of data (this is just dummy data, the code might not be pretty but that's what I came up with):

# create dummy data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

# mark outliers in table
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ TRUE,
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ TRUE,
                                TRUE ~ FALSE))

This results in the following plot:

ggplot(df, aes(x = Time,
               y = Value,
               label = Time)) +
  geom_boxplot(outlier.colour =  'red',
               outlier.shape = 1,
               outlier.size = 2
  ) +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

Goal:

What I want is that for each group A or B I can see if the outliers are the same. So for instance that in A T0 the shown outlier is the same as in A T1. More specifically the outlier seen as a circle in A T0 should be a circle in A T1 and the second outlier in A T1should be any other shape (e.g. triangle). Since my original data has about 5/6 time points it would be nice to know if an outlier stays an outlier by looking at the plot.
In some cases my original dataset has about 5-8 outliers.

In group B we can reuse the same shapes as in group A although we have different sample ID's than in group A.

I want to use basic shapes like triangles, circles, Asterix and so on (I know the shapes are limited but for my kind of dataset it should suffice). I also know that I can label the data points, but that I don't want.
Different colour would be okay too, but I'd prefer different shapes.

I guess I have to calculate outliers separately and then maybe use geom_point with aes(shape = df$Sample) or something. But I can't figure it out.

Does anybody has a hint or a solution based on my dummy data?
That would be awesome :-)

Best TMC

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁我热情 2025-01-23 18:43:09

我想出了一个非常丑陋的解决方案。我很确定有一种更漂亮的方法可以做到这一点，但这里是完整的代码：

首先我们创建虚拟数据：

# start with an clean environment
rm(list=ls())  
# create a function to load or install all necessary libraries 
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE))
    install.packages(x)
  require(x, character.only = TRUE)
}
package_vec <- c("ggplot2",
                 "dplyr"
)
sapply(package_vec, install.load.package)  

# now to the data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

然后我们按如下方式计算离群值，并创建一个新列，其中放置离群值的 ID。如果不是异常值，则插入“X”。

# calculate outliers
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ as.character(Sample),
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ as.character(Sample),
                                TRUE ~ as.character('X')))
df$Group <- as.factor(df$Group)

现在，我们用数字替换样本 ID。第一个离群值对的值为 1，第二个离群值对的值为 2，依此类推。如果异常值多于可用的“geom_points”形状，则必须调整代码。但我们假设异常值不超过 23 个（我认为这是最大数量）。

for (group in levels(df$Group)) {
  count <- 1
  for (id in levels(as.factor(df$is_outlier[which(df$Group == group)]))) {
    if (id == 'X') {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(NA)
    } else {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(count)
      count <- count + 1
    }
  }
}

这会覆盖之前创建的列。它引入了 X 值的 NA。

现在我们可以绘制数据

  ggplot(df, aes(x = Time,
              y = Value,
              label = Time)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data = df,
             shape= as.numeric(df$is_outlier),
             color = 'red') +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

结果如下图：

现在我们可以查看异常值是否从 T0 到 T1 始终保持异常值。请注意，在 B 组中我们使用相同的形状。但这些是完全不同的样本。必须调整绘图代码上方的代码来解决这一问题。但这样一来，我们可用的形状可能会更少。

如果你们中有人有更流畅、更优雅的解决方案，我很乐意学习。

最佳TMC

I figured out a really ugly solution. I'm pretty sure there is a prettier way to do this but here is the full code:

First we create dummy data:

# start with an clean environment
rm(list=ls())  
# create a function to load or install all necessary libraries 
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE))
    install.packages(x)
  require(x, character.only = TRUE)
}
package_vec <- c("ggplot2",
                 "dplyr"
)
sapply(package_vec, install.load.package)  

# now to the data
df <- data.frame()
set.seed(42)
os <- 0
sam <- 1
for (time in as.factor(c('T0', 'T1'))) {
  if (time == 'T1') {
    sam <- 1
  }
  for (group in as.factor(c('A','B'))) {
    for (pat in 1:10) {
      df[pat + os, 'Sample'] <- paste('P', pat, '_', sam, sep = '')
      df[pat + os, 'Time'] <- time
      df[pat + os, 'Group'] <- group
      df[pat + os, 'Value'] <- rnorm(1) + os
      # add outlier, they are the same in each group in this example,
      # but can differ in the real data set
      if (pat == 2 | pat == 9) {
        print(pat)
        df[pat + os, 'Value'] <- df[pat + os, 'Value'] + 10
      }
      sam <- sam + 1
    }
    os <- os + 10
  }
}

Then we calculate the outliers as following, and create a new column where the ID of the Outlier is placed. If it is not an outlier an 'X' is inserted

# calculate outliers
df = df %>% 
  group_by(Group,Time)  %>%
  mutate(is_outlier = case_when(Value > quantile(Value)[4] + 1.5*IQR(Value) ~ as.character(Sample),
                                Value < quantile(Value)[2] - 1.5*IQR(Value) ~ as.character(Sample),
                                TRUE ~ as.character('X')))
df$Group <- as.factor(df$Group)

Now, we replace the Sample ID with a number. The first outlier pair(s) gets the number 1, the second gets a 2 and so on. If there are more outliers than available `geom_points' shapes, the code has to be adapted. But lets just assume we don't have more than 23 outliers (I think that's the maximum amount).

for (group in levels(df$Group)) {
  count <- 1
  for (id in levels(as.factor(df$is_outlier[which(df$Group == group)]))) {
    if (id == 'X') {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(NA)
    } else {
      df[which(df$is_outlier == id), 'is_outlier'] <- as.character(count)
      count <- count + 1
    }
  }
}

this overwrites the previously created column. Its introducing NA's for the X values.

now we can plot the data

  ggplot(df, aes(x = Time,
              y = Value,
              label = Time)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data = df,
             shape= as.numeric(df$is_outlier),
             color = 'red') +
  facet_grid(~factor(Group),
             switch = 'x',
             scales = 'free_y')

This results in this plot:

Now we can see if an outlier stays an outlier from T0 to T1. Be aware that in Group B we use the same shape. But these are totally different samples. One has to adapt the code above the plotting code to account for this. But this way we would have potentially less shapes available.

If one of you has a smoother and more elegant solution, I'd be happy to learn.

Best TMC

回复收藏 0 原文

~没有更多了~