分层抽样 - 观察不足

发布于 2024-12-02 08:20:51 字数 1862 浏览 2 评论 0原文

我想要实现的是从每组中获取 10% 的样本（这是 2 个因素的组合 - 新近度和频率类别）。到目前为止，我已经考虑过包sampling和函数strata()。这看起来很有希望，但我收到以下错误，并且很难理解错误消息以及错误所在或如何解决此问题。

这是我的代码：

> d[1:10,]
        date id_email_op recency frequecy r_cat f_cat
1  29.8.2011       19393     294        1     A     G
2  29.8.2011       19394     230        4     A     D
3  29.8.2011       19395     238       12     A     B
4  29.8.2011       19396     294        1     A     G
5  29.8.2011       19397     223        9     A     C
6  29.8.2011       19398     185        7     A     C
7  29.8.2011       19399     273        2     A     F
8  29.8.2011       19400      16        4     C     D
9  29.8.2011       19401     294        1     A     G
10 29.8.2011       19402       3        5     F     C
> table(d$f_cat,d$r_cat)

         A      B      C      D      E      F
  A    176    203    289    228    335    983
  B   1044    966   1072    633    742   1398
  C   6623   3606   3020   1339   1534   2509
  D   4316   1790   1239    529    586    880
  E   8431   2798   2005    767    817   1151
  F  22140   5432   3937   1415   1361   1868
  G 100373  18316  11872   3760   3453   4778
> as.vector(table(d$f_cat,d$r_cat))
 [1]    176   1044   6623   4316   8431  22140 100373    203    966   3606   1790   2798   5432
[14]  18316    289   1072   3020   1239   2005   3937  11872    228    633   1339    529    767
[27]   1415   3760    335    742   1534    586    817   1361   3453    983   1398   2509    880
[40]   1151   1868   4778
> s <- strata(d,c("f_cat","r_cat"),size=as.vector(ceiling(0.1 * table(d$f_cat,d$r_cat))), method="srswor")
Error in strata(d, c("f_cat", "r_cat"), size = as.vector(table(d$f_cat,  : 
  not enough obervations for the stratum 6

我真的看不出什么是层 6。该函数在后台检查的条件是什么？我不确定尺寸参数设置是否正确。是的，我已经检查了采样包的文档:)

谢谢大家，

原文

What I would like to achieve is get a 10% sample from each group (which is a combination of 2 factors - recency and frequency category). So far I have thought about package sampling and function strata(). Which looks promising but I am getting the following error and it is really hard to understand the error message and what is wrong or how to get around this.

Here is my code:

> d[1:10,]
        date id_email_op recency frequecy r_cat f_cat
1  29.8.2011       19393     294        1     A     G
2  29.8.2011       19394     230        4     A     D
3  29.8.2011       19395     238       12     A     B
4  29.8.2011       19396     294        1     A     G
5  29.8.2011       19397     223        9     A     C
6  29.8.2011       19398     185        7     A     C
7  29.8.2011       19399     273        2     A     F
8  29.8.2011       19400      16        4     C     D
9  29.8.2011       19401     294        1     A     G
10 29.8.2011       19402       3        5     F     C
> table(d$f_cat,d$r_cat)

         A      B      C      D      E      F
  A    176    203    289    228    335    983
  B   1044    966   1072    633    742   1398
  C   6623   3606   3020   1339   1534   2509
  D   4316   1790   1239    529    586    880
  E   8431   2798   2005    767    817   1151
  F  22140   5432   3937   1415   1361   1868
  G 100373  18316  11872   3760   3453   4778
> as.vector(table(d$f_cat,d$r_cat))
 [1]    176   1044   6623   4316   8431  22140 100373    203    966   3606   1790   2798   5432
[14]  18316    289   1072   3020   1239   2005   3937  11872    228    633   1339    529    767
[27]   1415   3760    335    742   1534    586    817   1361   3453    983   1398   2509    880
[40]   1151   1868   4778
> s <- strata(d,c("f_cat","r_cat"),size=as.vector(ceiling(0.1 * table(d$f_cat,d$r_cat))), method="srswor")
Error in strata(d, c("f_cat", "r_cat"), size = as.vector(table(d$f_cat,  : 
  not enough obervations for the stratum 6

I cant really see what is stratum 6. What is the condition the function checks in background? I am not sure I that I have the size param set up correctly. And yes I have checked the documentation of sampling package :)

Thanks everyone and

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兮颜 2024-12-09 08:20:51

您始终可以自己做：

stratified <- NULL
for(x in 1:6) {
  tmp1 <- sample(rownames(subset(d, r_cat == "A" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "A")*0.1))
  tmp2 <- sample(rownames(subset(d, r_cat == "B" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "B")*0.1))
  tmp3 <- sample(rownames(subset(d, r_cat == "C" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "C")*0.1))
  tmp4 <- sample(rownames(subset(d, r_cat == "D" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "D")*0.1))
  tmp5 <- sample(rownames(subset(d, r_cat == "E" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "E")*0.1))
  tmp6 <- sample(rownames(subset(d, r_cat == "F" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "F")*0.1))
  tmp7 <- sample(rownames(subset(d, r_cat == "G" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "G")*0.1))
  stratified <- c(stratified,tmp1,tmp2,tmp3,tmp4,tmp5,tmp6,tmp7)
}

然后...

d[stratified,] 将是您的分层样本。

You could always do it yourself:

stratified <- NULL
for(x in 1:6) {
  tmp1 <- sample(rownames(subset(d, r_cat == "A" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "A")*0.1))
  tmp2 <- sample(rownames(subset(d, r_cat == "B" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "B")*0.1))
  tmp3 <- sample(rownames(subset(d, r_cat == "C" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "C")*0.1))
  tmp4 <- sample(rownames(subset(d, r_cat == "D" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "D")*0.1))
  tmp5 <- sample(rownames(subset(d, r_cat == "E" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "E")*0.1))
  tmp6 <- sample(rownames(subset(d, r_cat == "F" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "F")*0.1))
  tmp7 <- sample(rownames(subset(d, r_cat == "G" & f_cat == LETTERS[x])),round(nrow(d[r_cat == "G")*0.1))
  stratified <- c(stratified,tmp1,tmp2,tmp3,tmp4,tmp5,tmp6,tmp7)
}

And then...

d[stratified,] would be your stratified sample.

回复收藏 0 原文

阳光下的泡沫是彩色的 2024-12-09 08:20:51

问题解决了！
在此处输入图像描述

这句话“抽样框是按州内的区域分层的”帮助了我！
如果使用多个变量进行分层，则在为参数“size=”分配不同的大小时，必须注意这些变量的“顺序”。
变量的层越多，它的优先级就越高，因此，当您使用“table()”时，层数最多的变量应该位于列表的顶部。

我在 GENDER 中有 10 个组，在 Age.group 中有 2 个组，所以这不起作用

 nnum <- as.vector(table(d.order$GENDER,d.order$age.group))

但是这有效

    d.order <- d.cut[order(d.cut$age.group,d.cut$GENDER),]
nnum <- as.vector(table(d.order$age.group, d.order$GENDER))
    n <- round(.05*nnum)
    testData <- strata(d.order, stratanames=c("age.group","GENDER"),size=n,method="srswor")

我强烈建议您在整个脚本中除了函数 table() 或 ceilling() 之外相应地更改变量的顺序。
它解决了我的问题，希望它也能解决你的问题。
:)

Problem SOLVED!
enter image description here

This sentence "the sampling frame is stratified by region within state" helped me!
If you use more than one variables for stratification, the you must pay attention to the "order" of these variables when you assign different sizes to the agument "size=".
The more strata a variable has, the higher priority it has, therefore, the one which has the most strata should be on top of the list when you use "table()".

I hava 10 groups in GENDER, and 2 groups in age.group, so this WON'T work

 nnum <- as.vector(table(d.order$GENDER,d.order$age.group))

But this works

    d.order <- d.cut[order(d.cut$age.group,d.cut$GENDER),]
nnum <- as.vector(table(d.order$age.group, d.order$GENDER))
    n <- round(.05*nnum)
    testData <- strata(d.order, stratanames=c("age.group","GENDER"),size=n,method="srswor")

I highly recommend that you change the order of your variables accordingly throughout the whole script besides the function table() or ceilling().
It solved my problem, hopefully it will solve yours too.
:)

回复收藏 0 原文

~没有更多了~