大数据框架中使用DPLYR软件包的比例分配抽样

发布于 2025-02-06 15:29:34 字数 2624 浏览 1 评论 0原文

我有一个具有10525行的大数据集，其中有44个不同的类别：


var   = c(rep("a",440), rep("b",255) ,rep("c",333),rep("d",47) ,rep("e",159),rep("f",67) ,rep("g",133),
          rep("h",342), rep("i",131) ,rep("j",606),rep("k",129),rep("l",126),rep("m",155),rep("n",62),
          rep("o",616), rep("p", 173),rep("q",430),rep("r",2)  ,rep("s",453),
          rep("t",154), rep("v",145),rep("u", 307),rep("w",233),rep("x",315),rep("y",65),rep("z",159),
          rep("aa",758),rep("ab",307),rep("ac", 413),rep("ad",184),rep("ae",334),rep("af",111),rep("ag",175),
          rep("ah", 262),rep("ai",309),rep("aj",71),rep("ak",35),rep("al",302),
          rep("am",266), rep("an",36),rep("ao",47),rep("ap",415),rep("aq",204),rep("ar",259))
value = rnorm(10525)
dat = tibble(var,value)

现在，我想进行比例分配采样，即我想精确地对相应类别的NI列中的表中计算出的子示例数量（组））。

d1=dat%>%
  group_by(var)%>%
  summarise(N = n())

d2=dat%>%
  group_by(var)%>%
  summarise(w = n()/nrow(.))

A = left_join(d1,d2,by="var")%>%
+   mutate(Ni = round(N*w));A
# A tibble: 44 × 4
   var       N      w    Ni
   <chr> <int>  <dbl> <dbl>
 1 a       440 0.0418    18
 2 aa      758 0.0720    55
 3 ab      307 0.0292     9
 4 ac      413 0.0392    16
 5 ad      184 0.0175     3
 6 ae      334 0.0317    11
 7 af      111 0.0105     1
 8 ag      175 0.0166     3
 9 ah      262 0.0249     7
10 ai      309 0.0294     9
# … with 34 more rows

理论上正确的总样本量必须是：354

sum(A$Ni)
[1] 354

2问题：

1）我如何在r中做到这一点？

2）我如何限制，如果子样本（Ni）为0进行1观察？

有帮助吗？我会很感激。

我的努力

a1 = dat %>% 
  left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
  slice_sample(n = sum(A$Ni), weight_by = w)%>%
  select(c(var,Ni))%>%
  group_by(var)%>%
  summarise(n());a1
Joining, by = "var"
# A tibble: 41 × 2
   var   `n()`
   <chr> <int>
 1 a        17
 2 aa       21
 3 ab       11
 4 ac       11
 5 ad        7
 6 ae       10
 7 af        4
 8 ag        3
 9 ah       11
10 ai        4
# … with 31 more rows

，但应该是所有44个组。或者

A = left_join(d1,d2,by="var")%>%
  mutate(Ni = round(N*w))%>%
  mutate(x = replace(Ni,Ni==0,1));A
sum(A$x)
print(tibble(A),n=44)

a1 = dat %>% 
  left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
  slice_sample(n = sum(A$x), weight_by = w)%>%
  select(c(var,Ni))%>%
  group_by(var)%>%
  summarise(n())
print(tibble(a1),n=44)

再次不是从所有组中采样。

原文

I have a big data set with 10525 rows with 44 different categories:


var   = c(rep("a",440), rep("b",255) ,rep("c",333),rep("d",47) ,rep("e",159),rep("f",67) ,rep("g",133),
          rep("h",342), rep("i",131) ,rep("j",606),rep("k",129),rep("l",126),rep("m",155),rep("n",62),
          rep("o",616), rep("p", 173),rep("q",430),rep("r",2)  ,rep("s",453),
          rep("t",154), rep("v",145),rep("u", 307),rep("w",233),rep("x",315),rep("y",65),rep("z",159),
          rep("aa",758),rep("ab",307),rep("ac", 413),rep("ad",184),rep("ae",334),rep("af",111),rep("ag",175),
          rep("ah", 262),rep("ai",309),rep("aj",71),rep("ak",35),rep("al",302),
          rep("am",266), rep("an",36),rep("ao",47),rep("ap",415),rep("aq",204),rep("ar",259))
value = rnorm(10525)
dat = tibble(var,value)

Now I want to make proportional allocation sampling i.e I want to sample exactly the number of subsamples as has been calculated in the table below in the column Ni for the corresponding category (group).

d1=dat%>%
  group_by(var)%>%
  summarise(N = n())

d2=dat%>%
  group_by(var)%>%
  summarise(w = n()/nrow(.))

A = left_join(d1,d2,by="var")%>%
+   mutate(Ni = round(N*w));A
# A tibble: 44 × 4
   var       N      w    Ni
   <chr> <int>  <dbl> <dbl>
 1 a       440 0.0418    18
 2 aa      758 0.0720    55
 3 ab      307 0.0292     9
 4 ac      413 0.0392    16
 5 ad      184 0.0175     3
 6 ae      334 0.0317    11
 7 af      111 0.0105     1
 8 ag      175 0.0166     3
 9 ah      262 0.0249     7
10 ai      309 0.0294     9
# … with 34 more rows

The theoretically correct total sample size must be: 354

sum(A$Ni)
[1] 354

2 questions:

1) How I can do that in R ?

2) How I can put a constrain that if the subsample (Ni) is 0 to take 1 observation ?

Any help? I would appreciate it .

My effort

a1 = dat %>% 
  left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
  slice_sample(n = sum(A$Ni), weight_by = w)%>%
  select(c(var,Ni))%>%
  group_by(var)%>%
  summarise(n());a1
Joining, by = "var"
# A tibble: 41 × 2
   var   `n()`
   <chr> <int>
 1 a        17
 2 aa       21
 3 ab       11
 4 ac       11
 5 ad        7
 6 ae       10
 7 af        4
 8 ag        3
 9 ah       11
10 ai        4
# … with 31 more rows

but it should be all the 44 groups.
Or

A = left_join(d1,d2,by="var")%>%
  mutate(Ni = round(N*w))%>%
  mutate(x = replace(Ni,Ni==0,1));A
sum(A$x)
print(tibble(A),n=44)

a1 = dat %>% 
  left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
  slice_sample(n = sum(A$x), weight_by = w)%>%
  select(c(var,Ni))%>%
  group_by(var)%>%
  summarise(n())
print(tibble(a1),n=44)

but again does not sample from all the groups.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小巷里的女流氓 2025-02-13 15:29:34

我不确定这是否确切地做您想要的事情，但希望能帮助您的目标：

首先，我将Ni始终至少1个的原始数据添加到原始数据。（a $ ni）行，通过修改的ni加权。由于这是一个样本，因此观察值的数量与Ni不匹配，但与之成正比。这意味着我们可能会错过一些var值，而低ni ...

dat %>% 
  left_join(A %>% mutate(Ni = pmax(Ni, 1))) %>% # make Ni be at least 1
  slice_sample(n = sum(A$Ni), weight_by = Ni) %>%
  add_count(var, Ni, name = "Ni_sampled")

I'm not sure if this is doing exactly what you want, but hopefully helps toward your goal:

First, I add to the original data a version of A where Ni is always at least 1. Then we can sample for sum(A$Ni) rows, weighting by the modified Ni. Since this is a sample, the number of observations does not match Ni but is proportional to it. This means we will likely miss some var values with a low Ni...

dat %>% 
  left_join(A %>% mutate(Ni = pmax(Ni, 1))) %>% # make Ni be at least 1
  slice_sample(n = sum(A$Ni), weight_by = Ni) %>%
  add_count(var, Ni, name = "Ni_sampled")

回复收藏 0 原文

~没有更多了~