如何随机采样文件的内容?
我有一个文件,内容
abc
def
high
lmn
...
...
有超过 200 万行。 我想从文件中随机采样行并输出 50K 行。 关于如何解决这个问题有什么想法吗? 我正在沿着 Perl 及其 rand 函数的思路思考(或者一个方便的 shell 命令会很简洁)。
相关(可能重复)问题:
I have a file with contents
abc
def
high
lmn
...
...
There are more than 2 million lines in the files.
I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand
function (Or a handy shell command would be neat).
Related (Possibly Duplicate) Questions:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
假设您基本上想要输出所有行的 2.5% 左右,则可以这样做:
Assuming you basically want to output about 2.5% of all lines, this would do:
脱壳方式:
Shell way:
来自 perlfaq5:“如何从文件中选择随机行?”
缺少将文件加载到数据库中或预先索引文件中的行,您可以执行以下操作。
这是来自 Camel Book 的水库采样算法:
与读取整个文件相比,这在空间上具有显着的优势。您可以在计算机编程的艺术,第 2 卷中找到此方法的证明,第 3.4.2 节,作者:Donald E. Knuth。
您可以使用 File::Random 模块,它为该算法提供函数:
另一种方法是使用 Tie::File 模块,它将整个文件视为数组。 只需访问随机数组元素即可。
From perlfaq5: "How do I select a random line from a file?"
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that algorithm:
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.
Perl方式:
使用CPAN。 有模块 File::RandomLine 完全可以满足您的需求。
Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.
如果您需要提取精确的行数:
If you need to extract an exact number of lines: