当前位置：文江博客话题详情

使用 awk 代码从文件中随机选择 3000 行

发布于 2024-12-06 03:09:37 字数 80 浏览 0 评论 0原文

我想从包含 8000 行的 Sample.file 中随机选择 3000 行。我将使用 awk 代码或从命令行执行此操作。我怎样才能做到这一点？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

源来凯始玺欢你 2024-12-13 03:09:37

如果你有 gnu sort，这很容易：

sort -R FILE | head -n3000

如果你有 gnu shuf，那就更容易了：

shuf -n3000 FILE

If you have gnu sort, it's easy:

sort -R FILE | head -n3000

If you have gnu shuf, it's even easier:

shuf -n3000 FILE

回复收藏 0 原文

才能让你更想念 2024-12-13 03:09:37

awk 'BEGIN{srand();}
{a[NR]=$0}
END{for(i=1; i<=3000; i++){x=int(rand()*NR) + 1; print a[x];}}' yourFile

awk 'BEGIN{srand();}
{a[NR]=$0}
END{for(i=1; i<=3000; i++){x=int(rand()*NR) + 1; print a[x];}}' yourFile

回复收藏 0 原文

戏蝶舞 2024-12-13 03:09:37

根据 Glenn 的评论进行了修复：

awk 'BEGIN {
  a=8000; l=3000
  srand(); nr[x]
  while (length(nr) <= l) 
    nr[int(rand() * a) +  1]    
  }
NR in nr
  ' infile

PS 将数组传递给 length 内置函数是不可移植的，您已被警告:)

Fixed as per Glenn's comment:

awk 'BEGIN {
  a=8000; l=3000
  srand(); nr[x]
  while (length(nr) <= l) 
    nr[int(rand() * a) +  1]    
  }
NR in nr
  ' infile

P.S. Passing an array to the length built-in function is not portable, you've been warned :)

回复收藏 0 原文

尘世孤行 2024-12-13 03:09:37

您可以使用 awk、sort、head/tail 和 sed 的组合来执行此操作，例如：

pax$ seq 1 100 | awk '
...$    BEGIN {srand()}
...$          {print rand() " " $0}
...$ ' | sort | head -5 | sed 's/[^ ]* //'
57
25
80
51
72

如您所见，它从 seq 1 100 中生成的一百行中随机选择 5 行。

awk 技巧在文件中的每一行前面加上一个随机数和格式为 "0.237788 " 的空格，然后（显然）根据该随机数对其进行排序。

然后，您使用 head （如果没有 head，则使用 tail）来获取第一个（或最后一个）N< /代码>行。

最后，sed 将去掉随机数和空格以及每行的开头。

对于您的具体情况，您可以使用类似的内容（一行）：

awk 'BEGIN {srand()} {print rand() " " $0}' file8000.txt
    | sort
    | tail -3000
    | sed 's/[^ ]* //'
    >file3000.txt

You can use a combination of awk, sort, head/tail and sed to do this, such as with:

pax$ seq 1 100 | awk '
...$    BEGIN {srand()}
...$          {print rand() " " $0}
...$ ' | sort | head -5 | sed 's/[^ ]* //'
57
25
80
51
72

which, as you can see, selects five random lines from the one hundred generated in seq 1 100.

The awk trick prefixes each and every line in the file with a random number and space of the format "0.237788 ", then sort (obviously) sorts it based on that random number.

Then you use head (or tail if you don't have a head) to get the first (or last) N lines.

Finally, the sed will strip off the random number and space and the start of each line.

For your specific case, you could use something like (on one line):

awk 'BEGIN {srand()} {print rand() " " $0}' file8000.txt
    | sort
    | tail -3000
    | sed 's/[^ ]* //'
    >file3000.txt

回复收藏 0 原文

洛阳烟雨空心柳 2024-12-13 03:09:37

我使用了这些命令，得到了我想要的：

awk 'BEGIN {srand()} {print rand() " " $0}' examples/data_text.txt | sort -n | tail -n 80 | awk '{printf "%1d %s %s\n",$2, $3, $4}' > examples/crossval.txt

它实际上从输入文件中随机选择 80 行。

I used these commands, and got what I wanted:

awk 'BEGIN {srand()} {print rand() " " $0}' examples/data_text.txt | sort -n | tail -n 80 | awk '{printf "%1d %s %s\n",$2, $3, $4}' > examples/crossval.txt

which in fact randomly selects 80 lines from the input file.

回复收藏 0 原文

李白 2024-12-13 03:09:37

在 PowerShell 中：

Get-Content myfile | Get-Random -Count 3000

或更短：

gc myfile | random -c 3000

In PowerShell:

Get-Content myfile | Get-Random -Count 3000

or shorter:

gc myfile | random -c 3000

回复收藏 0 原文

动次打次papapa 2024-12-13 03:09:37

如果您只需要大约 3000 行，这是一个简单的方法：

awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE

反引号 (`) 之间的部分给出了文件中的行数。

In case you only need approximately 3000 lines, this is an easy method:

awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE

The part between the backticks (`) gives the number of lines in the file.

回复收藏 0 原文

山有枢 2024-12-13 03:09:37

对于我不想洗牌的大文件，这效果很好而且相当快：

sed -u -n 'l1p;l2p; ... ;l1000p;l1000q'

-u 选项减少缓冲，l1、l2、... l1000 是从 R 获得的随机且排序的行号（对于 python 或 perl 来说同样好）。

回复收藏 0 原文

~没有更多了~

关于作者

你的心境我的脸

暂无简介

0 文章

0 评论

25 人气

关注发私信

醉城メ夜风

文章 0 评论 0

关注

远昼

文章 0 评论 0

关注

平生欢

文章 0 评论 0

关注

微凉

文章 0 评论 0

关注

Honwey

文章 0 评论 0

关注

qq_ikhFfg

文章 0 评论 0

友情链接

文江博客

使用 awk 代码从文件中随机选择 3000 行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

使用 awk 代码从文件中随机选择 3000 行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。