如何轻松生成/模拟不同组的示例数据以进行建模

发布于 2025-01-12 10:07:14 字数 1243 浏览 1 评论 0原文

如何轻松生成/模拟有意义的建模示例数据:例如,告诉我给我 n 行数据,对于 2 个组,他​​们的性别分布和平均年龄应分别相差 X 和 Y 单位?有没有一种简单的方法可以自动完成?有包吗?

例如,生成此类数据的最简单方法是什么?

  • 组: 两组:A、B
  • 性别: 不同性别分布:A 30%,B 70%
  • 年龄: 不同平均年龄:A 50 ,B 70

PS! Tidyverse 解决方案特别受欢迎。

到目前为止,我最好的尝试仍然是相当多的代码:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

< img src="https://i.sstatic.net/NA4gR.png" alt="在此处输入图像描述">

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

在此处输入图像描述

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?

For example, what would be the simplest way for generating such data?

  • groups: two groups: A, B
  • sex: different sex distributions: A 30%, B 70%
  • age: different mean ages: A 50, B 70

PS! Tidyverse solutions are especially welcome.

My best try so far is still quite a lot of code:

n=100
d = bind_rows(
  #group A females
  tibble(group = rep("A"),
         sex = rep("Female"),
         age = rnorm(n*0.4, 50, 4)),
  #group B females
  tibble(group = rep("B"),
         sex = rep("Female"),
         age = rnorm(n*0.3, 45, 4)),
  #group A males
  tibble(group = rep("A"),
         sex = rep("Male"),
         age = rnorm(n*0.20, 60, 6)),
  #group B males
  tibble(group = rep("B"),
         sex = rep("Male"),
         age = rnorm(n*0.10, 55, 4)))

enter image description here

d %>% group_by(group, sex) %>% 
  summarise(n = n(),
            mean_age = mean(age))

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半夏半凉 2025-01-19 10:07:14

在 R 中,有很多方法可以从向量中进行采样并从随机分布中进行绘制。例如,您请求的数据集可以这样创建:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

我们可以使用 tidyverse 来显示它执行了预期的操作:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50

There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:

set.seed(69) # Makes samples reproducible

df <- data.frame(groups = rep(c("A", "B"), each = 100),
                 sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
                         sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
                 age = c(runif(100, 25, 75), runif(100, 50, 90)))

And we can use the tidyverse to show it does what was expected:

library(dplyr)

df %>% 
  group_by(groups) %>% 
  summarise(age = mean(age),
            percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#>   groups   age percent_male
#>   <chr>  <dbl>        <int>
#> 1 A       49.4           29
#> 2 B       71.0           50
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文