如何轻松生成/模拟不同组的示例数据以进行建模

发布于 2025-01-12 10:07:14 字数 1243 浏览 1 评论 0原文

如何轻松生成/模拟有意义的建模示例数据：例如，告诉我给我 n 行数据，对于 2 个组，他们的性别分布和平均年龄应分别相差 X 和 Y 单位？有没有一种简单的方法可以自动完成？有包吗？

例如，生成此类数据的最简单方法是什么？

组：两组：A、B
性别： 不同性别分布：A 30%，B 70%
年龄： 不同平均年龄：A 50 ，B 70

PS！ Tidyverse 解决方案特别受欢迎。

到目前为止，我最好的尝试仍然是相当多的代码：

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

< img src="https://i.sstatic.net/NA4gR.png" alt="在此处输入图像描述">

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

原文

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

groups: two groups: A, B
sex: different sex distributions: A 30%, B 70%
age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半夏半凉 2025-01-19 10:07:14

在 R 中，有很多方法可以从向量中进行采样并从随机分布中进行绘制。例如，您请求的数据集可以这样创建：

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

我们可以使用 tidyverse 来显示它执行了预期的操作：

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50

There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

And we can use the tidyverse to show it does what was expected:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50

回复收藏 0 原文

~没有更多了~