使用 awk 代码从文件中随机选择 3000 行
我想从包含 8000 行的 Sample.file 中随机选择 3000 行。 我将使用 awk 代码或从命令行执行此操作。我怎样才能做到这一点?
I want to select randomly 3000 lines from a sample.file which contains 8000 lines.
I will do that with awk codes or do from command line. How can I do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果你有 gnu sort,这很容易:
如果你有 gnu shuf,那就更容易了:
If you have gnu sort, it's easy:
If you have gnu shuf, it's even easier:
根据 Glenn 的评论进行了修复:
PS 将数组传递给 length 内置函数是不可移植的,您已被警告:)
Fixed as per Glenn's comment:
P.S. Passing an array to the length built-in function is not portable, you've been warned :)
您可以使用
awk
、sort
、head/tail
和sed
的组合来执行此操作,例如:如您所见,它从
seq 1 100
中生成的一百行中随机选择 5 行。awk
技巧在文件中的每一行前面加上一个随机数和格式为"0.237788 "
的空格,然后(显然)根据该随机数对其进行排序。然后,您使用
head
(如果没有head
,则使用tail
)来获取第一个(或最后一个)N< /代码>行。
最后,
sed
将去掉随机数和空格以及每行的开头。对于您的具体情况,您可以使用类似的内容(一行):
You can use a combination of
awk
,sort
,head/tail
andsed
to do this, such as with:which, as you can see, selects five random lines from the one hundred generated in
seq 1 100
.The
awk
trick prefixes each and every line in the file with a random number and space of the format"0.237788 "
, then sort (obviously) sorts it based on that random number.Then you use
head
(ortail
if you don't have ahead
) to get the first (or last)N
lines.Finally, the
sed
will strip off the random number and space and the start of each line.For your specific case, you could use something like (on one line):
我使用了这些命令,得到了我想要的:
它实际上从输入文件中随机选择 80 行。
I used these commands, and got what I wanted:
which in fact randomly selects 80 lines from the input file.
在 PowerShell 中:
或更短:
In PowerShell:
or shorter:
如果您只需要大约 3000 行,这是一个简单的方法:
反引号 (`) 之间的部分给出了文件中的行数。
In case you only need approximately 3000 lines, this is an easy method:
The part between the backticks (`) gives the number of lines in the file.
对于我不想洗牌的大文件,这效果很好而且相当快:
sed -u -n 'l1p;l2p; ... ;l1000p;l1000q'
-u 选项减少缓冲,l1、l2、... l1000 是从 R 获得的随机且排序的行号(对于 python 或 perl 来说同样好)。
For a huge file that I didn't want to shuffle, this worked out well and pretty fast:
sed -u -n 'l1p;l2p; ... ;l1000p;l1000q'
The -u option reduces buffering, and l1, l2, ... l1000 are random and sorted line numbers obtained from R (would be just as good with python or perl).