从文件中选择随机行

发布于 2025-01-04 08:05:35 字数 61 浏览 1 评论 0原文

在 Bash 脚本中,我想从输入文件中随机挑选 N 行并输出到另一个文件。

这怎么能做到呢?

In a Bash script, I want to pick out N random lines from input file and output to another file.

How can this be done?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

葬花如无物 2025-01-11 08:05:35

shuf-n 选项一起使用,如下所示,获得 N 条随机行:

shuf -n N input > output

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output
伪心 2025-01-11 08:05:35

随机对文件进行排序并选择前 100 行:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed 

当然,<$input_file 可以替换为任何管道标准输入。这(sort -R$'...\t...'sed匹配制表符)适用于GNU/ Linux 和 BSD/macOS。

/^[ \t]*$/d' | sort -R | head -n $lines

当然,<$input_file 可以替换为任何管道标准输入。这(sort -R$'...\t...'sed匹配制表符)适用于GNU/ Linux 和 BSD/macOS。

Sort the file randomly and pick first 100 lines:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed 

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

/^[ \t]*$/d' | sort -R | head -n $lines

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

阿楠 2025-01-11 08:05:35

好吧,根据 shuf 答案的评论,他在一分钟内洗掉了 78 000 000 000 行。

接受挑战...

编辑:我打破了我自己的记录

powershuf 在 0.047 秒内完成了它

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

这么快的原因是,我不读取整个文件,只是移动文件指针 10 次并打印指针后面的行。

Gitlab Repo

老尝试

首先我需要一个 78.000.000.000 行的文件:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

这给了我一个带有 780 亿个换行符;-)

现在来看看 shuf 部分:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

瓶颈是 CPU 并且不使用多线程,它将 1 个核心固定在 100% 的另一个核心上15 没有使用。

Python 是我经常使用的工具,因此我将使用它来加快速度:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

这让我花费了不到一分钟的时间:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

我在配有 i9 和三星 NVMe 的 Lenovo X1 Extreme 第二代上完成了此操作,这给了我足够的读写速度。

我知道它可以变得更快,但我会留出一些空间给其他人尝试。

线路计数器来源:Luther Blissett

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

梦中楼上月下 2025-01-11 08:05:35

我的首选选项非常快,我采样了一个制表符分隔的数据文件,包含 13 列、2310 万行、2.0GB 未压缩。

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total
迎风吟唱 2025-01-11 08:05:35
seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'
seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'
始终不够爱げ你 2025-01-11 08:05:35

这是一个“增量随机采样器”,它在一次传递中从任意数量的行中精确选取 N 个样本,并且不会在内存中存储超过 N 行。

  • 它恰好选择了 N 个样本。
  • 每条线被选择的概率相等。
  • 它只读取一次输入。
  • 无需排序、迭代或比较。
  • 它一次不会在内存中存储超过 N 行。

它的工作原理如下:

  1. 将前 N 行存储在样本 [0..N-1] 中。
  2. N 行后,从 0 到 (#linesSoFar - 1) 中选择一个随机数 r
  3. 如果 r < N,用新行替换样本[r]。否则,请跳过它。
  4. 读取所有行后,对样本 [] 进行打乱,以防前 N 行中的任何一行碰巧仍然以其原始的非随机顺序存在。

我做了一个证明,以确保这使每条线都有相同的概率包含在最终样本中。我还做了一些大型的实证实验来证明同样的事情。

当我需要从一段时间内未知的数百万条条目的日志中获取 10,000 条消息的随机样本时,我想到了这个算法。使用这种方法,我不需要一次存储超过 N 个消息,也不需要提前猜测要保留哪一部分消息才能最终获得所需的 N 个样本。

这是一个 python 实现。您可以通过以下方式从命令行调用它:

`cat <inputFile> | python incremental_sampler.py [#samples=1000] > <outputFile>`

或者从您自己的 python 程序调用它。它非常简单,我发现移植到其他语言是微不足道的。

import random
import sys


# Incremental random sampler.
#   Author: Randy Wilson, Ph.D.
#   Date: 14 February 2007
# Generates a random sample of all the values passed to it,
#   without ever having to store more than the number of samples
#   being taken ('max_samples').
# Stores the first 'max_samples' values. 
# Then chooses a random index from 0..num_samples-1.
#   If the random index is within the first max_samples, 
#     then that array element is replaced.
#     Otherwise, the new value is ignored.
# This gives every incoming sample the same probability of
#   max_samples/num_samples of being included.
#
# For example, if max_samples=1000, and 50,000 values are 
#   passed to add_sample(), then the first 1000 samples are all kept.
# After that, there is a 1000/1001, 1000/1002, etc., chance
#   of each sample replacing one selected earlier.
# After adding all 50,000 samples, get_samples() will shuffle the 
#   1000 samples that were kept and return them.
# Never were more than 1000 samples stored during the entire process.
# This means you could get a sample from billions of values 
#   without blowing out memory.
class IncrementalSampler:
    def __init__(self, sample_size):
        # Number of samples desired in the end
        self.sample_size = sample_size
        # Number of samples added via add_sample so far
        self.num_samples = 0
        # Values included in the random sample so far. These may be replaced by values added later.
        self.samples = []

    def add_sample(self, value):
        if self.num_samples < self.sample_size:
            self.samples.append(value)
        else:
            position = random.randint(0, self.num_samples)
            if position < self.sample_size:
                self.samples[position] = value
        self.num_samples += 1

    def get_samples(self):
        random.shuffle(self.samples)
        return self.samples


# Command-line interface.
# Usage: cat <file> | incremental_sampler.py [#samples] > <outputFile>
sample_size = 1000
if len(sys.argv) > 1:
    sample_size = int(sys.argv[1])
sampler = IncrementalSampler(sample_size)
for line in sys.stdin:
    sampler.add_sample(line.rstrip('\r\n'))
for line in sampler.get_samples():
    print(line)

Here is an "incremental random sampler" that picks exactly N samples from any number of lines in one pass, never storing more than N lines in memory.

  • It selects exactly N samples.
  • Each line has an equal probability of being chosen.
  • It reads through the input just once.
  • No sorting, iterating or comparing needed.
  • It never stores more than N lines in memory at a time.

It works as follows:

  1. Store the first N lines in samples[0..N-1]
  2. After N lines, pick a random number r from 0 to (#linesSoFar - 1)
  3. If r < N, replace samples[r] with the new line. Otherwise, skip it.
  4. After reading all the lines, shuffle samples[] just in case any of the first N lines happen to still be there in their original non-random order.

I did a proof to make sure that this gives each line an equal probability of being included in the final sample. I also did some large empirical experiments to demonstrate the same thing.

I came up with this algorithm when I needed to get a random sample of 10,000 messages from a log with unknown millions of entries over a variable amount of time. Using this approach, I didn't need to store more than N messages at once, nor guess in advance what fraction of the messages to keep in order to end up with the desired N samples.

Here is a python implementation. You can call it from the command line via:

`cat <inputFile> | python incremental_sampler.py [#samples=1000] > <outputFile>`

or call it from your own python program. It is simple enough that I have found it trivial to port to other languages.

import random
import sys


# Incremental random sampler.
#   Author: Randy Wilson, Ph.D.
#   Date: 14 February 2007
# Generates a random sample of all the values passed to it,
#   without ever having to store more than the number of samples
#   being taken ('max_samples').
# Stores the first 'max_samples' values. 
# Then chooses a random index from 0..num_samples-1.
#   If the random index is within the first max_samples, 
#     then that array element is replaced.
#     Otherwise, the new value is ignored.
# This gives every incoming sample the same probability of
#   max_samples/num_samples of being included.
#
# For example, if max_samples=1000, and 50,000 values are 
#   passed to add_sample(), then the first 1000 samples are all kept.
# After that, there is a 1000/1001, 1000/1002, etc., chance
#   of each sample replacing one selected earlier.
# After adding all 50,000 samples, get_samples() will shuffle the 
#   1000 samples that were kept and return them.
# Never were more than 1000 samples stored during the entire process.
# This means you could get a sample from billions of values 
#   without blowing out memory.
class IncrementalSampler:
    def __init__(self, sample_size):
        # Number of samples desired in the end
        self.sample_size = sample_size
        # Number of samples added via add_sample so far
        self.num_samples = 0
        # Values included in the random sample so far. These may be replaced by values added later.
        self.samples = []

    def add_sample(self, value):
        if self.num_samples < self.sample_size:
            self.samples.append(value)
        else:
            position = random.randint(0, self.num_samples)
            if position < self.sample_size:
                self.samples[position] = value
        self.num_samples += 1

    def get_samples(self):
        random.shuffle(self.samples)
        return self.samples


# Command-line interface.
# Usage: cat <file> | incremental_sampler.py [#samples] > <outputFile>
sample_size = 1000
if len(sys.argv) > 1:
    sample_size = int(sys.argv[1])
sampler = IncrementalSampler(sample_size)
for line in sys.stdin:
    sampler.add_sample(line.rstrip('\r\n'))
for line in sampler.get_samples():
    print(line)
狂之美人 2025-01-11 08:05:35

只是为了完整起见,并且因为它可以从 Arch 的社区存储库中获得:还有一个名为 shuffle 的工具,但它没有任何命令行开关来限制行数并在其手册页中发出警告: “由于 shuffle 将输入读取到内存中,因此在处理非常大的文件时可能会失败。”

Just for completeness's sake and because it's available from Arch's community repos: there's also a tool called shuffle, but it doesn't have any command line switches to limit the number of lines and warns in its man page: "Since shuffle reads the input into memory, it may fail on very large files."

刘备忘录 2025-01-11 08:05:35
# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"
# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"
网名女生简单气质 2025-01-11 08:05:35

下面的“c”是从输入中选择的行数。根据需要修改:

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"

In the below 'c' is the number of lines to select from the input. Modify as needed:

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文