如何随机采样文件的内容?

发布于 2024-07-25 05:23:31 字数 535 浏览 8 评论 0原文

我有一个文件,内容

abc
def
high
lmn
...
...

有超过 200 万行。 我想从文件中随机采样行并输出 50K 行。 关于如何解决这个问题有什么想法吗? 我正在沿着 Perl 及其 rand 函数的思路思考(或者一个方便的 shell 命令会很简洁)。

相关(可能重复)问题:

I have a file with contents

abc
def
high
lmn
...
...

There are more than 2 million lines in the files.
I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).

Related (Possibly Duplicate) Questions:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

淤浪 2024-08-01 05:23:31

假设您基本上想要输出所有行的 2.5% 左右,则可以这样做:

print if 0.025 > rand while <$input>;

Assuming you basically want to output about 2.5% of all lines, this would do:

print if 0.025 > rand while <$input>;
失去的东西太少 2024-08-01 05:23:31

脱壳方式:

sort -R file | head -n 50000

Shell way:

sort -R file | head -n 50000
落叶缤纷 2024-08-01 05:23:31

来自 perlfaq5:“如何从文件中选择随机行?”


缺少将文件加载到数据库中或预先索引文件中的行,您可以执行以下操作。

这是来自 Camel Book 的水库采样算法:

srand;
rand($.) < 1 && ($line = $_) while <>;

与读取整个文件相比,这在空间上具有显着的优势。您可以在计算机编程的艺术,第 2 卷中找到此方法的证明,第 3.4.2 节,作者:Donald E. Knuth。

您可以使用 File::Random 模块,它为该算法提供函数:

use File::Random qw/random_line/;
my $line = random_line($filename);

另一种方法是使用 Tie::File 模块,它将整个文件视为数组。 只需访问随机数组元素即可。

From perlfaq5: "How do I select a random line from a file?"


Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.

Here's a reservoir-sampling algorithm from the Camel Book:

srand;
rand($.) < 1 && ($line = $_) while <>;

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

You can use the File::Random module which provides a function for that algorithm:

use File::Random qw/random_line/;
my $line = random_line($filename);

Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.

变身佩奇 2024-08-01 05:23:31

Perl方式:

使用CPAN。 有模块 File::RandomLine 完全可以满足您的需求。

Perl way:

use CPAN. There is module File::RandomLine that does exactly what you need.

开始看清了 2024-08-01 05:23:31

如果您需要提取精确的行数:

use strict;
use warnings;

# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;

open(my $fh, '<', $file)
    or die "Can't read file '$file' [$!]\n";

# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
    $lines += ($buffer =~ tr/\n//);
}

# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;

# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
    my $n = int(rand($lines)) + 1;
    redo if $picked{$n}++
}

# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
    print if $picked{$.};
}
close $fh;

If you need to extract an exact number of lines:

use strict;
use warnings;

# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;

open(my $fh, '<', $file)
    or die "Can't read file '$file' [$!]\n";

# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
    $lines += ($buffer =~ tr/\n//);
}

# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;

# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
    my $n = int(rand($lines)) + 1;
    redo if $picked{$n}++
}

# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
    print if $picked{$.};
}
close $fh;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文