在一个样本中随机和非随机采样

发布于 2025-01-30 14:42:41 字数 1012 浏览 3 评论 0原文

有没有一种方法可以在单个样本中采样X X数量的随机行和X非随机行? 例如,我想获得1,000个4行iris的样本。我想随机采样3行iris,第四行将与每个示例中的一个相同(这是模仿混合采样设计)。

我可以采样3个随机行1000x和固定行1000x,然后将两个数据帧合并在一起,但由于某些原因,这不是理想的情况。执行此操作的代码看起来如下:

df<- iris

fixed_sample<- iris[7,]

random<- list()
fixed<- list()

counter<- 0
for (i in 1:1000) {
  # sample 4 randomly selected transects 100 time
  tempsample_random<- df[sample(1:nrow(df), 3, replace=F),]
  tempsample_fixed<- fixed_sample[sample(1:nrow(fixed_sample), 1, replace=F), ]
  
  random[[i]]=tempsample_random
  fixed[[i]]=tempsample_fixed
  
  
  counter<- counter+1
  print(counter)
}


random_results<- do.call(rbind, random)
fixed_results<- do.call(rbind, fixed)

从这里,我将制作一个新列作为分组变量,然后根据该组将它们合并在一起。因此,最终数据框架的每四个行都有3个随机行,每个示例中的行数(<代码> fixed_sample )。

我研究了使用splitStackShape :: strapified,但还没有按照我需要的方式工作。我将在几个级别的采样工作中(样本2、3、4、5行,等等1,000倍)进行此操作,因此能够从同一样本中从同一样本中拉出固定和随机行是理想的开始。

任何帮助将不胜感激。

Is there a way to sample X number of random rows and X non-random rows in a single sample?
For example, I want to get 1,000 samples of 4 rows of iris. I want to randomly sample 3 rows of iris and the fourth row will be the same one in each sample (this is to mimic a hybrid sampling design).

I can sample 3 random rows 1000x and the fixed row 1000x and then merge the two data frames together, but for a few reasons this is not an ideal situation. The code to do that looks something like the following:

df<- iris

fixed_sample<- iris[7,]

random<- list()
fixed<- list()

counter<- 0
for (i in 1:1000) {
  # sample 4 randomly selected transects 100 time
  tempsample_random<- df[sample(1:nrow(df), 3, replace=F),]
  tempsample_fixed<- fixed_sample[sample(1:nrow(fixed_sample), 1, replace=F), ]
  
  random[[i]]=tempsample_random
  fixed[[i]]=tempsample_fixed
  
  
  counter<- counter+1
  print(counter)
}


random_results<- do.call(rbind, random)
fixed_results<- do.call(rbind, fixed)

From here I would make a new column as a grouping variable and then merge them together based on that group. So every four rows of the final data frame has 3 random rows and row number 7 (fixed_sample) in each sample.

I've looked into using splitstackshape::stratified, but haven't gotten it to work the way I need it to. I'll be doing this over several levels of sampling effort (sample 2, 3, 4, 5 rows, etc. 1,000x each) so it would be ideal to be able to pull the fixed and random rows in the same sample from the beginning.

Any help would be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

萌梦深 2025-02-06 14:42:41

我认为您可以使用lapply在一行中进行此操作。在这种情况下,我们将绘制3个样本,但是您可以将seq(3)更改为seq(1000)以获取1000个样本。我遵循了您的示例,并选择了第7行作为固定行。

lapply(seq(3), function(i) iris[c(sample(seq(nrow(iris))[-7], 3), 7),])
#> [[1]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 105          6.5         3.0          5.8         2.2  virginica
#> 111          6.5         3.2          5.1         2.0  virginica
#> 7            4.6         3.4          1.4         0.3     setosa
#> 
#> [[2]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 147          6.3         2.5          5.0         1.9 virginica
#> 131          7.4         2.8          6.1         1.9 virginica
#> 126          7.2         3.2          6.0         1.8 virginica
#> 7            4.6         3.4          1.4         0.3    setosa
#> 
#> [[3]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 143          5.8         2.7          5.1         1.9  virginica
#> 145          6.7         3.3          5.7         2.5  virginica
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 7            4.6         3.4          1.4         0.3     setosa

I think you can do this in a single line using lapply. In this case we will draw 3 samples, but you can change seq(3) to seq(1000) to get your 1000 samples. I have followed your example and selected row 7 as the fixed row.

lapply(seq(3), function(i) iris[c(sample(seq(nrow(iris))[-7], 3), 7),])
#> [[1]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 67           5.6         3.0          4.5         1.5 versicolor
#> 105          6.5         3.0          5.8         2.2  virginica
#> 111          6.5         3.2          5.1         2.0  virginica
#> 7            4.6         3.4          1.4         0.3     setosa
#> 
#> [[2]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 147          6.3         2.5          5.0         1.9 virginica
#> 131          7.4         2.8          6.1         1.9 virginica
#> 126          7.2         3.2          6.0         1.8 virginica
#> 7            4.6         3.4          1.4         0.3    setosa
#> 
#> [[3]]
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 143          5.8         2.7          5.1         1.9  virginica
#> 145          6.7         3.3          5.7         2.5  virginica
#> 60           5.2         2.7          3.9         1.4 versicolor
#> 7            4.6         3.4          1.4         0.3     setosa

Created on 2022-05-18 by the reprex package (v2.0.1)

邮友 2025-02-06 14:42:41

这是一种方法:

fixed_row <- 7
set.seed(42)
random <- replicate(1000, df[c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3)),], simplify = FALSE)
random[1:3]
# [[1]]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 7           4.6         3.4          1.4         0.3     setosa
# 50          5.0         3.3          1.4         0.2     setosa
# 66          6.7         3.1          4.4         1.4 versicolor
# 75          6.4         2.9          4.3         1.3 versicolor
# [[2]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 123          7.7         2.8          6.7         2.0 virginica
# 50           5.0         3.3          1.4         0.2    setosa
# [[3]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 129          6.4         2.8          5.6         2.1 virginica
# 48           4.6         3.2          1.4         0.2    setosa
# 25           4.8         3.4          1.9         0.2    setosa

目的是我们采样所有行 您打算在所有样本中包含的固定行,然后将其预先列入行索引列表。使用setDiff(..,fixed_row)的前提EM>带有所需最终结果的行索引。

set.seed(42)
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]  7 50 66 75
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 147 123  50
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 129  48  25

(请注意,set.seed的使用仅是在stackoverflow上的可重复性,您可能不应在生产中使用它。)

Here's a method:

fixed_row <- 7
set.seed(42)
random <- replicate(1000, df[c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3)),], simplify = FALSE)
random[1:3]
# [[1]]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 7           4.6         3.4          1.4         0.3     setosa
# 50          5.0         3.3          1.4         0.2     setosa
# 66          6.7         3.1          4.4         1.4 versicolor
# 75          6.4         2.9          4.3         1.3 versicolor
# [[2]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 123          7.7         2.8          6.7         2.0 virginica
# 50           5.0         3.3          1.4         0.2    setosa
# [[3]]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 7            4.6         3.4          1.4         0.3    setosa
# 129          6.4         2.8          5.6         2.1 virginica
# 48           4.6         3.2          1.4         0.2    setosa
# 25           4.8         3.4          1.9         0.2    setosa

The intent is that we sample all rows except the fixed row that you intend to include in all samples, then prepend it to the list of row indices. Using the premise of setdiff(.., fixed_row) allows you to use arbitrary sets here, so it would be feasible for fixed_row to have zero or more row indices with the desired end result.

set.seed(42)
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]  7 50 66 75
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 147 123  50
c(fixed_row, sample(setdiff(seq_len(nrow(df)), fixed_row), size = 3))
# [1]   7 129  48  25

(Note that the use of set.seed is just for reproducibility here on StackOverflow, you should likely not use that in production.)

一萌ing 2025-02-06 14:42:41
df <- iris

fixed_row = 2
resample_count = 1000

keep_rows <- unlist(
  Map(1:resample_count,
      f = function(x) c(fixed_row, sample(1:nrow(df),3))
      )
)

resamples <- iris[keep_rows,]
df <- iris

fixed_row = 2
resample_count = 1000

keep_rows <- unlist(
  Map(1:resample_count,
      f = function(x) c(fixed_row, sample(1:nrow(df),3))
      )
)

resamples <- iris[keep_rows,]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文