分层抽样 - 观察不足
我想要实现的是从每组中获取 10% 的样本(这是 2 个因素的组合 - 新近度和频率类别)。到目前为止,我已经考虑过包sampling和函数strata()。这看起来很有希望,但我收到以下错误,并且很难理解错误消息以及错误所在或如何解决此问题。
这是我的代码:
> d[1:10,]
date id_email_op recency frequecy r_cat f_cat
1 29.8.2011 19393 294 1 A G
2 29.8.2011 19394 230 4 A D
3 29.8.2011 19395 238 12 A B
4 29.8.2011 19396 294 1 A G
5 29.8.2011 19397 223 9 A C
6 29.8.2011 19398 185 7 A C
7 29.8.2011 19399 273 2 A F
8 29.8.2011 19400 16 4 C D
9 29.8.2011 19401 294 1 A G
10 29.8.2011 19402 3 5 F C
> table(d$f_cat,d$r_cat)
A B C D E F
A 176 203 289 228 335 983
B 1044 966 1072 633 742 1398
C 6623 3606 3020 1339 1534 2509
D 4316 1790 1239 529 586 880
E 8431 2798 2005 767 817 1151
F 22140 5432 3937 1415 1361 1868
G 100373 18316 11872 3760 3453 4778
> as.vector(table(d$f_cat,d$r_cat))
[1] 176 1044 6623 4316 8431 22140 100373 203 966 3606 1790 2798 5432
[14] 18316 289 1072 3020 1239 2005 3937 11872 228 633 1339 529 767
[27] 1415 3760 335 742 1534 586 817 1361 3453 983 1398 2509 880
[40] 1151 1868 4778
> s <- strata(d,c("f_cat","r_cat"),size=as.vector(ceiling(0.1 * table(d$f_cat,d$r_cat))), method="srswor")
Error in strata(d, c("f_cat", "r_cat"), size = as.vector(table(d$f_cat, :
not enough obervations for the stratum 6
我真的看不出什么是层 6。该函数在后台检查的条件是什么?我不确定尺寸参数设置是否正确。是的,我已经检查了采样包的文档:)
谢谢大家,
What I would like to achieve is get a 10% sample from each group (which is a combination of 2 factors - recency and frequency category). So far I have thought about package sampling and function strata(). Which looks promising but I am getting the following error and it is really hard to understand the error message and what is wrong or how to get around this.
Here is my code:
> d[1:10,]
date id_email_op recency frequecy r_cat f_cat
1 29.8.2011 19393 294 1 A G
2 29.8.2011 19394 230 4 A D
3 29.8.2011 19395 238 12 A B
4 29.8.2011 19396 294 1 A G
5 29.8.2011 19397 223 9 A C
6 29.8.2011 19398 185 7 A C
7 29.8.2011 19399 273 2 A F
8 29.8.2011 19400 16 4 C D
9 29.8.2011 19401 294 1 A G
10 29.8.2011 19402 3 5 F C
> table(d$f_cat,d$r_cat)
A B C D E F
A 176 203 289 228 335 983
B 1044 966 1072 633 742 1398
C 6623 3606 3020 1339 1534 2509
D 4316 1790 1239 529 586 880
E 8431 2798 2005 767 817 1151
F 22140 5432 3937 1415 1361 1868
G 100373 18316 11872 3760 3453 4778
> as.vector(table(d$f_cat,d$r_cat))
[1] 176 1044 6623 4316 8431 22140 100373 203 966 3606 1790 2798 5432
[14] 18316 289 1072 3020 1239 2005 3937 11872 228 633 1339 529 767
[27] 1415 3760 335 742 1534 586 817 1361 3453 983 1398 2509 880
[40] 1151 1868 4778
> s <- strata(d,c("f_cat","r_cat"),size=as.vector(ceiling(0.1 * table(d$f_cat,d$r_cat))), method="srswor")
Error in strata(d, c("f_cat", "r_cat"), size = as.vector(table(d$f_cat, :
not enough obervations for the stratum 6
I cant really see what is stratum 6. What is the condition the function checks in background? I am not sure I that I have the size param set up correctly. And yes I have checked the documentation of sampling package :)
Thanks everyone and
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您始终可以自己做:
然后...
d[stratified,]
将是您的分层样本。You could always do it yourself:
And then...
d[stratified,]
would be your stratified sample.问题解决了!
在此处输入图像描述
这句话“抽样框是按州内的区域分层的”帮助了我!
如果使用多个变量进行分层,则在为参数“size=”分配不同的大小时,必须注意这些变量的“顺序”。
变量的层越多,它的优先级就越高,因此,当您使用“table()”时,层数最多的变量应该位于列表的顶部。
我在 GENDER 中有 10 个组,在 Age.group 中有 2 个组,所以这不起作用
但是这有效
我强烈建议您在整个脚本中除了函数 table() 或 ceilling() 之外相应地更改变量的顺序。
它解决了我的问题,希望它也能解决你的问题。
:)
Problem SOLVED!
enter image description here
This sentence "the sampling frame is stratified by region within state" helped me!
If you use more than one variables for stratification, the you must pay attention to the "order" of these variables when you assign different sizes to the agument "size=".
The more strata a variable has, the higher priority it has, therefore, the one which has the most strata should be on top of the list when you use "table()".
I hava 10 groups in GENDER, and 2 groups in age.group, so this WON'T work
But this works
I highly recommend that you change the order of your variables accordingly throughout the whole script besides the function table() or ceilling().
It solved my problem, hopefully it will solve yours too.
:)