多次随机划分数据集并计算均值和方差
我从未就这个问题得出任何结论,所以我想我会重新措辞并再次询问。
我想对我的数据集进行 10,000 次子采样,以生成每个响应的均值和 95% CI。
以下是数据集结构的示例:
x <- read.table(tc <- textConnection("
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 3 B 1.2 2.9
1 4 C 1.5 2.3
2 1 A 1.7 0.3
2 2 A 1.9 0.3
3 1 A 0.2 0.5"), header = TRUE); close(tc)
我只想对每个研究/变量组合进行一次子采样。例如,子集化的数据集将如下所示:
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 4 C 1.5 2.3
2 1 A 1.7 0.3
3 1 A 0.2 0.5
请注意,第 3 行和第 6 行消失了,因为它们都测量了一个变量两次(第一种情况为 B,第二种情况为 A)。
我想一次又一次地绘制子采样数据集,这样我就可以得出 value1 和 value2 的整体平均值,每个变量的置信区间为 95%。因此,在整个子采样例程之后我想要的输出是:
variable mean_value1 lower_value1 upper_value1 mean_value2 etc....
A 2.3 2.0 2.6 2.1
B 2.5 2.0 3.0 2.5
C 2.1 1.9 2.3 2.6
这是我必须获取子集的一些代码:
subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination,
#how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there
sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
x$value[idx+expIdx] #now get a vector of values
})
感谢任何帮助。我知道这很复杂,所以如果您需要澄清,请告诉我!
I never came to any conclusions re: this question, so I thought I would rephrase it and ask again.
I would like to subsample my dataset 10,000 times to generate means and 95% CIs for each of my responses.
Here is an example of how the data set is structured:
x <- read.table(tc <- textConnection("
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 3 B 1.2 2.9
1 4 C 1.5 2.3
2 1 A 1.7 0.3
2 2 A 1.9 0.3
3 1 A 0.2 0.5"), header = TRUE); close(tc)
I would like to subsample each study/variable combination only once. So, for example, the subsetted dataset would look like this:
study expt variable value1 value2
1 1 A 1.0 1.1
1 2 B 1.1 2.1
1 4 C 1.5 2.3
2 1 A 1.7 0.3
3 1 A 0.2 0.5
Notice rows 3 and 6 are gone, because both measured a variable twice (B in the first case, A in the second case).
I want to draw subsampled data sets again and again so I may derive overall means of value1 and value2 with 95% CIs for each variable. So the output I would like after the whole subsampling routine would be:
variable mean_value1 lower_value1 upper_value1 mean_value2 etc....
A 2.3 2.0 2.6 2.1
B 2.5 2.0 3.0 2.5
C 2.1 1.9 2.3 2.6
Here is some code I have to grab the subset:
subsample<-function(x, B){
samps<-ddply(x, .(study,variable), nrow)[,3] #for each study/variable combination,
#how many experiments are there
expIdx<-which(!duplicated(x$study)) #what is the first row of each study
n<-length(samps) #how many studies are there
sapply(1:B, function(a) { #use sapply for the looping, as it's more efficient than for
idx<-floor(runif(n, rep(0,n), samps)) #get the experiment number-1 for each study
x$value[idx+expIdx] #now get a vector of values
})
Any help is appreciated. I recognize this is complicated so please let me know if you need clarification!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
按研究、实验和变量拆分数据,然后将引导程序应用于每个子集。有很多方法可以做到这一点,包括:
Split your data by Study, Experiment and Variable, then apply the bootstrap to each subset. There are many ways to do this, including:
这是一个解决方案,虽然是公平的警告,但它的扩展性不会很好,而且我不知道这种方案的统计有效性:
示例输出
Here's a solution, although fair warning, it's not going to scale terribly well and I'm unaware of the statistical validity of this kind of scheme:
Example output