对数据框中的随机行进行采样

发布于 2024-12-17 23:21:10 字数 57 浏览 2 评论 0 原文

我正在努力寻找合适的函数来返回从 R 语言的数据框中随机选取的指定行数而不进行替换?有人可以帮我吗?

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

白龙吟 2024-12-24 23:21:10

首先制作一些数据:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

然后随机选择一些行:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110
风为裳 2024-12-24 23:21:10

约翰·科尔比给出的答案是正确的答案。但是,如果您是 dplyr 用户,还有答案 sample_n

sample_n(df, 10)

从数据帧中随机采样 10 行。它调用 sample.int,因此实际上是相同的答案,但输入较少(并且简化了在 magrittr 上下文中的使用,因为数据帧是第一个参数)。

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10)

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).

眼趣 2024-12-24 23:21:10

data.table 包提供了DT[sample(.N, M)] 函数,从数据表DT 中采样M 个随机行。

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
不乱于心 2024-12-24 23:21:10

写一篇吧!包装 JC 的答案给了我:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

现在通过首先检查是否 n<=nrow(df) 并停止并出现错误来使其更好。

Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

Now make it better by checking first if n<=nrow(df) and stopping with an error.

記憶穿過時間隧道 2024-12-24 23:21:10

只是为了完整起见:

dplyr 还提供通过以下方式绘制样本的比例或分数。

df %>% sample_frac(0.33)

这非常方便,例如在机器学习中,当您必须执行特定的分割比(例如 80%:20%)时

Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33)

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%

谈情不如逗狗 2024-12-24 23:21:10

正如 @matt_b 所示,sample_n() & sample_frac() 已被软弃用,取而代之的是 slice_sample()。请参阅 dplyr 文档

文档字符串示例:

# slice_sample() allows you to random select with or without replacement
mtcars %>% slice_sample(n = 5)
mtcars %>% slice_sample(n = 5, replace = TRUE)

As @matt_b indicates, sample_n() & sample_frac() have been soft deprecated in favour of slice_sample(). See the dplyr docs.

Example from docstring:

# slice_sample() allows you to random select with or without replacement
mtcars %>% slice_sample(n = 5)
mtcars %>% slice_sample(n = 5, replace = TRUE)

送舟行 2024-12-24 23:21:10

过时的答案。请改用dplyr::slice_sample()

我的R包中,有一个函数sample.rows专门用于此目的目的:

install.packages('kimisc')

library(kimisc)
example(sample.rows)

smpl..> set.seed(42)

smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6),
                               row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

根据 Joris Meys 对 sample 是一个坏主意href="https://stackoverflow.com/a/16538269/946850">上一个答案

Outdated answer. Please use dplyr::slice_sample() instead.

In my R package there is a function sample.rows just for this purpose:

install.packages('kimisc')

library(kimisc)
example(sample.rows)

smpl..> set.seed(42)

smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6),
                               row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

Enhancing sample by making it a generic S3 function was a bad idea, according to comments by Joris Meys to a previous answer.

三五鸿雁 2024-12-24 23:21:10

编辑:此答案现已过时,请参阅更新版本

我的R包中,我增强了sample,因此它现在的行为如下数据帧也是如此:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
example(sample.data.frame)

smpl..> set.seed(42)

smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6),
                           row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

这是实现的使 sample 成为 S3 通用方法并在函数中提供必要的(简单的)功能。调用 setMethod 可以解决所有问题。原始实现仍然可以通过base::sample访问。

EDIT: This answer is now outdated, see the updated version.

In my R package I have enhanced sample so that it now behaves as expected also for data frames:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
example(sample.data.frame)

smpl..> set.seed(42)

smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6),
                           row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

This is achieved by making sample an S3 generic method and providing the necessary (trivial) functionality in a function. A call to setMethod fixes everything. The original implementation still can be accessed through base::sample.

鹤仙姿 2024-12-24 23:21:10

你可以这样做:

library(dplyr)

cols <- paste0("a", 1:10)
tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols)
tab
# A tibble: 100 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     1   101   201   301   401   501   601   701   801   901
 2     2   102   202   302   402   502   602   702   802   902
 3     3   103   203   303   403   503   603   703   803   903
 4     4   104   204   304   404   504   604   704   804   904
 5     5   105   205   305   405   505   605   705   805   905
 6     6   106   206   306   406   506   606   706   806   906
 7     7   107   207   307   407   507   607   707   807   907
 8     8   108   208   308   408   508   608   708   808   908
 9     9   109   209   309   409   509   609   709   809   909
10    10   110   210   310   410   510   610   710   810   910
# ... with 90 more rows

上面我刚刚制作了一个包含 10 列和 100 行的数据框,好吗?

现在您可以使用 sample_n 对其进行采样:

sample_n(tab, size = 800, replace = T)
# A tibble: 800 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1    53   153   253   353   453   553   653   753   853   953
 2    14   114   214   314   414   514   614   714   814   914
 3    10   110   210   310   410   510   610   710   810   910
 4    70   170   270   370   470   570   670   770   870   970
 5    36   136   236   336   436   536   636   736   836   936
 6    77   177   277   377   477   577   677   777   877   977
 7    13   113   213   313   413   513   613   713   813   913
 8    58   158   258   358   458   558   658   758   858   958
 9    29   129   229   329   429   529   629   729   829   929
10     3   103   203   303   403   503   603   703   803   903
# ... with 790 more rows

You could do this:

library(dplyr)

cols <- paste0("a", 1:10)
tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols)
tab
# A tibble: 100 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     1   101   201   301   401   501   601   701   801   901
 2     2   102   202   302   402   502   602   702   802   902
 3     3   103   203   303   403   503   603   703   803   903
 4     4   104   204   304   404   504   604   704   804   904
 5     5   105   205   305   405   505   605   705   805   905
 6     6   106   206   306   406   506   606   706   806   906
 7     7   107   207   307   407   507   607   707   807   907
 8     8   108   208   308   408   508   608   708   808   908
 9     9   109   209   309   409   509   609   709   809   909
10    10   110   210   310   410   510   610   710   810   910
# ... with 90 more rows

Above I just made a dataframe with 10 columns and 100 rows, ok?

Now you can sample it with sample_n:

sample_n(tab, size = 800, replace = T)
# A tibble: 800 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1    53   153   253   353   453   553   653   753   853   953
 2    14   114   214   314   414   514   614   714   814   914
 3    10   110   210   310   410   510   610   710   810   910
 4    70   170   270   370   470   570   670   770   870   970
 5    36   136   236   336   436   536   636   736   836   936
 6    77   177   277   377   477   577   677   777   877   977
 7    13   113   213   313   413   513   613   713   813   913
 8    58   158   258   358   458   558   658   758   858   958
 9    29   129   229   329   429   529   629   729   829   929
10     3   103   203   303   403   503   603   703   803   903
# ... with 790 more rows
墨落画卷 2024-12-24 23:21:10

2021 年在 tidyverse 中执行此操作的方法是:

library(tidyverse)

df = data.frame(
  A = letters[1:10],
  B = 1:10
)

df
#>    A  B
#> 1  a  1
#> 2  b  2
#> 3  c  3
#> 4  d  4
#> 5  e  5
#> 6  f  6
#> 7  g  7
#> 8  h  8
#> 9  i  9
#> 10 j 10

df %>% sample_n(5)
#>   A  B
#> 1 e  5
#> 2 g  7
#> 3 h  8
#> 4 b  2
#> 5 j 10

df %>% sample_frac(0.5)
#>   A  B
#> 1 i  9
#> 2 g  7
#> 3 j 10
#> 4 c  3
#> 5 b  2

reprex 包于 2021 年 10 月 5 日创建 (v2.0.0.9000)

The 2021 way of doing this in the tidyverse is:

library(tidyverse)

df = data.frame(
  A = letters[1:10],
  B = 1:10
)

df
#>    A  B
#> 1  a  1
#> 2  b  2
#> 3  c  3
#> 4  d  4
#> 5  e  5
#> 6  f  6
#> 7  g  7
#> 8  h  8
#> 9  i  9
#> 10 j 10

df %>% sample_n(5)
#>   A  B
#> 1 e  5
#> 2 g  7
#> 3 h  8
#> 4 b  2
#> 5 j 10

df %>% sample_frac(0.5)
#>   A  B
#> 1 i  9
#> 2 g  7
#> 3 j 10
#> 4 c  3
#> 5 b  2

Created on 2021-10-05 by the reprex package (v2.0.0.9000)

往日情怀 2024-12-24 23:21:10

从 R 中的 tibble 类型中选择随机样本:

library("tibble")    
a <- your_tibble[sample(1:nrow(your_tibble), 150),]

nrow 接受一个 tibble 并返回行数。传递给 sample 的第一个参数是从 1 到 tibble 末尾的范围。传递给样本的第二个参数 150 是您想要的随机采样数量。方括号切片指定返回索引的行。变量“a”获取随机采样的值。

Select a Random sample from a tibble type in R:

library("tibble")    
a <- your_tibble[sample(1:nrow(your_tibble), 150),]

nrow takes a tibble and returns the number of rows. The first parameter passed to sample is a range from 1 to the end of your tibble. The second parameter passed to sample, 150, is how many random samplings you want. The square bracket slicing specifies the rows of the indices returned. Variable 'a' gets the value of the random sampling.

落墨 2024-12-24 23:21:10

你可以这样做:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ]

You could do this:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ]
眼藏柔 2024-12-24 23:21:10

我是 R 新手,但我使用的是这种对我有用的简单方法:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),]

PS:请随意注意它是否有一些我没有考虑到的缺点。

I'm new in R, but I was using this easy method that works for me:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),]

PS: Feel free to note if it has some drawback I'm not thinking about.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文