大数据框架中使用DPLYR软件包的比例分配抽样
我有一个具有10525行的大数据集,其中有44个不同的类别:
var = c(rep("a",440), rep("b",255) ,rep("c",333),rep("d",47) ,rep("e",159),rep("f",67) ,rep("g",133),
rep("h",342), rep("i",131) ,rep("j",606),rep("k",129),rep("l",126),rep("m",155),rep("n",62),
rep("o",616), rep("p", 173),rep("q",430),rep("r",2) ,rep("s",453),
rep("t",154), rep("v",145),rep("u", 307),rep("w",233),rep("x",315),rep("y",65),rep("z",159),
rep("aa",758),rep("ab",307),rep("ac", 413),rep("ad",184),rep("ae",334),rep("af",111),rep("ag",175),
rep("ah", 262),rep("ai",309),rep("aj",71),rep("ak",35),rep("al",302),
rep("am",266), rep("an",36),rep("ao",47),rep("ap",415),rep("aq",204),rep("ar",259))
value = rnorm(10525)
dat = tibble(var,value)
现在,我想进行比例分配采样,即我想精确地对相应类别的NI列中的表中计算出的子示例数量(组) )。
d1=dat%>%
group_by(var)%>%
summarise(N = n())
d2=dat%>%
group_by(var)%>%
summarise(w = n()/nrow(.))
A = left_join(d1,d2,by="var")%>%
+ mutate(Ni = round(N*w));A
# A tibble: 44 × 4
var N w Ni
<chr> <int> <dbl> <dbl>
1 a 440 0.0418 18
2 aa 758 0.0720 55
3 ab 307 0.0292 9
4 ac 413 0.0392 16
5 ad 184 0.0175 3
6 ae 334 0.0317 11
7 af 111 0.0105 1
8 ag 175 0.0166 3
9 ah 262 0.0249 7
10 ai 309 0.0294 9
# … with 34 more rows
理论上正确的总样本量必须是:354
sum(A$Ni)
[1] 354
2问题:
1)我如何在r中做到这一点?
2)我如何限制,如果子样本(Ni)为0进行1观察?
有帮助吗?我会很感激。
我的努力
a1 = dat %>%
left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
slice_sample(n = sum(A$Ni), weight_by = w)%>%
select(c(var,Ni))%>%
group_by(var)%>%
summarise(n());a1
Joining, by = "var"
# A tibble: 41 × 2
var `n()`
<chr> <int>
1 a 17
2 aa 21
3 ab 11
4 ac 11
5 ad 7
6 ae 10
7 af 4
8 ag 3
9 ah 11
10 ai 4
# … with 31 more rows
,但应该是所有44个组。 或者
A = left_join(d1,d2,by="var")%>%
mutate(Ni = round(N*w))%>%
mutate(x = replace(Ni,Ni==0,1));A
sum(A$x)
print(tibble(A),n=44)
a1 = dat %>%
left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
slice_sample(n = sum(A$x), weight_by = w)%>%
select(c(var,Ni))%>%
group_by(var)%>%
summarise(n())
print(tibble(a1),n=44)
再次不是从所有组中采样。
I have a big data set with 10525 rows with 44 different categories:
var = c(rep("a",440), rep("b",255) ,rep("c",333),rep("d",47) ,rep("e",159),rep("f",67) ,rep("g",133),
rep("h",342), rep("i",131) ,rep("j",606),rep("k",129),rep("l",126),rep("m",155),rep("n",62),
rep("o",616), rep("p", 173),rep("q",430),rep("r",2) ,rep("s",453),
rep("t",154), rep("v",145),rep("u", 307),rep("w",233),rep("x",315),rep("y",65),rep("z",159),
rep("aa",758),rep("ab",307),rep("ac", 413),rep("ad",184),rep("ae",334),rep("af",111),rep("ag",175),
rep("ah", 262),rep("ai",309),rep("aj",71),rep("ak",35),rep("al",302),
rep("am",266), rep("an",36),rep("ao",47),rep("ap",415),rep("aq",204),rep("ar",259))
value = rnorm(10525)
dat = tibble(var,value)
Now I want to make proportional allocation sampling i.e I want to sample exactly the number of subsamples as has been calculated in the table below in the column Ni for the corresponding category (group).
d1=dat%>%
group_by(var)%>%
summarise(N = n())
d2=dat%>%
group_by(var)%>%
summarise(w = n()/nrow(.))
A = left_join(d1,d2,by="var")%>%
+ mutate(Ni = round(N*w));A
# A tibble: 44 × 4
var N w Ni
<chr> <int> <dbl> <dbl>
1 a 440 0.0418 18
2 aa 758 0.0720 55
3 ab 307 0.0292 9
4 ac 413 0.0392 16
5 ad 184 0.0175 3
6 ae 334 0.0317 11
7 af 111 0.0105 1
8 ag 175 0.0166 3
9 ah 262 0.0249 7
10 ai 309 0.0294 9
# … with 34 more rows
The theoretically correct total sample size must be: 354
sum(A$Ni)
[1] 354
2 questions:
1) How I can do that in R ?
2) How I can put a constrain that if the subsample (Ni) is 0 to take 1 observation ?
Any help? I would appreciate it .
My effort
a1 = dat %>%
left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
slice_sample(n = sum(A$Ni), weight_by = w)%>%
select(c(var,Ni))%>%
group_by(var)%>%
summarise(n());a1
Joining, by = "var"
# A tibble: 41 × 2
var `n()`
<chr> <int>
1 a 17
2 aa 21
3 ab 11
4 ac 11
5 ad 7
6 ae 10
7 af 4
8 ag 3
9 ah 11
10 ai 4
# … with 31 more rows
but it should be all the 44 groups.
Or
A = left_join(d1,d2,by="var")%>%
mutate(Ni = round(N*w))%>%
mutate(x = replace(Ni,Ni==0,1));A
sum(A$x)
print(tibble(A),n=44)
a1 = dat %>%
left_join(A %>% mutate(w = n()/nrow(.), w = if_else(w <= 0.009, 1, w)) )%>%
slice_sample(n = sum(A$x), weight_by = w)%>%
select(c(var,Ni))%>%
group_by(var)%>%
summarise(n())
print(tibble(a1),n=44)
but again does not sample from all the groups.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不确定这是否确切地做您想要的事情,但希望能帮助您的目标:
首先,我将Ni始终至少1个的原始数据添加到原始数据。 (a $ ni)行,通过修改的
ni
加权。由于这是一个样本,因此观察值的数量与Ni不匹配,但与之成正比。这意味着我们可能会错过一些var
值,而低ni ...I'm not sure if this is doing exactly what you want, but hopefully helps toward your goal:
First, I add to the original data a version of A where Ni is always at least 1. Then we can sample for
sum(A$Ni)
rows, weighting by the modifiedNi
. Since this is a sample, the number of observations does not match Ni but is proportional to it. This means we will likely miss somevar
values with a low Ni...