从文件中选择随机行
在 Bash 脚本中,我想从输入文件中随机挑选 N 行并输出到另一个文件。
这怎么能做到呢?
In a Bash script, I want to pick out N random lines from input file and output to another file.
How can this be done?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
将
shuf
与-n
选项一起使用,如下所示,获得N
条随机行:Use
shuf
with the-n
option as shown below, to getN
random lines:随机对文件进行排序并选择前 100 行:
当然,
<$input_file
可以替换为任何管道标准输入。这(sort -R
和$'...\t...'
让sed
匹配制表符)适用于GNU/ Linux 和 BSD/macOS。Sort the file randomly and pick first
100
lines:Of course
<$input_file
can be replaced with any piped standard input. This (sort -R
and$'...\t...'
to getsed
to match tab chars) works with GNU/Linux and BSD/macOS.好吧,根据 shuf 答案的评论,他在一分钟内洗掉了 78 000 000 000 行。
接受挑战...
编辑:我打破了我自己的记录
powershuf 在 0.047 秒内完成了它
这么快的原因是,我不读取整个文件,只是移动文件指针 10 次并打印指针后面的行。
Gitlab Repo
老尝试
首先我需要一个 78.000.000.000 行的文件:
这给了我一个带有 780 亿个换行符;-)
现在来看看 shuf 部分:
瓶颈是 CPU 并且不使用多线程,它将 1 个核心固定在 100% 的另一个核心上15 没有使用。
Python 是我经常使用的工具,因此我将使用它来加快速度:
这让我花费了不到一分钟的时间:
我在配有 i9 和三星 NVMe 的 Lenovo X1 Extreme 第二代上完成了此操作,这给了我足够的读写速度。
我知道它可以变得更快,但我会留出一些空间给其他人尝试。
线路计数器来源:Luther Blissett
Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.
Challenge accepted...
EDIT: I beat my own record
powershuf did it in 0.047 seconds
The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.
Gitlab Repo
Old attempt
First I needed a file of 78.000.000.000 lines:
This gives me a a file with 78 Billion newlines ;-)
Now for the shuf part:
The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.
Python is what I regularly use so that's what I'll use to make this faster:
This got me just under a minute:
I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.
I know it can get faster but I'll leave some room to give others a try.
Line counter source: Luther Blissett
我的首选选项非常快,我采样了一个制表符分隔的数据文件,包含 13 列、2310 万行、2.0GB 未压缩。
My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.
这是一个“增量随机采样器”,它在一次传递中从任意数量的行中精确选取 N 个样本,并且不会在内存中存储超过 N 行。
它的工作原理如下:
我做了一个证明,以确保这使每条线都有相同的概率包含在最终样本中。我还做了一些大型的实证实验来证明同样的事情。
当我需要从一段时间内未知的数百万条条目的日志中获取 10,000 条消息的随机样本时,我想到了这个算法。使用这种方法,我不需要一次存储超过 N 个消息,也不需要提前猜测要保留哪一部分消息才能最终获得所需的 N 个样本。
这是一个 python 实现。您可以通过以下方式从命令行调用它:
或者从您自己的 python 程序调用它。它非常简单,我发现移植到其他语言是微不足道的。
Here is an "incremental random sampler" that picks exactly N samples from any number of lines in one pass, never storing more than N lines in memory.
It works as follows:
I did a proof to make sure that this gives each line an equal probability of being included in the final sample. I also did some large empirical experiments to demonstrate the same thing.
I came up with this algorithm when I needed to get a random sample of 10,000 messages from a log with unknown millions of entries over a variable amount of time. Using this approach, I didn't need to store more than N messages at once, nor guess in advance what fraction of the messages to keep in order to end up with the desired N samples.
Here is a python implementation. You can call it from the command line via:
or call it from your own python program. It is simple enough that I have found it trivial to port to other languages.
只是为了完整起见,并且因为它可以从 Arch 的社区存储库中获得:还有一个名为
shuffle
的工具,但它没有任何命令行开关来限制行数并在其手册页中发出警告: “由于 shuffle 将输入读取到内存中,因此在处理非常大的文件时可能会失败。”Just for completeness's sake and because it's available from Arch's community repos: there's also a tool called
shuffle
, but it doesn't have any command line switches to limit the number of lines and warns in its man page: "Since shuffle reads the input into memory, it may fail on very large files."下面的“c”是从输入中选择的行数。根据需要修改:
In the below 'c' is the number of lines to select from the input. Modify as needed: