如何在Python中从gzip压缩文件中获取随机行而不将其读入内存

发布于 2025-01-06 12:12:59 字数 973 浏览 1 评论 0原文

假设我有一个 531 gig 的 gzip 文本文件,其中有 512 548 457 601 475 行,由 '\n' 分割,并且希望在不进行文件分割的情况下从中获取随机行。 (别担心,它并不是真的那么大;只是想说明它是一个巨大的文件,并且我知道它有多少行。)

我通常如何使用较小的压缩文件来做到这一点:

import fileinput
import gzip
import random

list = []

for line in fileinput.input(file, openhook=gzip.open):
    list.append(line)

listLength = len(list)
randomListLineOne = line[random.randint(0, listLength)]
randomListLineTwo = line[random.randint(0, listLength)]
...

我在主题:

如何随机读取一个文件中的行python?

import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
      if random.randrange(num + 2): continue
      line = aline
    return line

Waterman 的“Reservoir Algorithm”,由 Alex Martelli 翻译自 Knuth 的“计算机编程的艺术”

您能将其改编为压缩文件吗?我尝试将压缩文件设置为文件,但这不起作用。 或者还有另一种(更简单的)方法来实现这一目标?

Let's say I have a 531 gig gzipped textfile with exactly 512 548 457 601 475 lines split by '\n' and wanted to get a random line out of it without filesplitting. (Don't worry, it's not really THAT large; just wanted to state that it's a huge file and I know how many lines it has.)

How I would normally do it with a smaller compressed file:

import fileinput
import gzip
import random

list = []

for line in fileinput.input(file, openhook=gzip.open):
    list.append(line)

listLength = len(list)
randomListLineOne = line[random.randint(0, listLength)]
randomListLineTwo = line[random.randint(0, listLength)]
...

What I've found on the topic:

How do I read a random line from one file in python?

import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
      if random.randrange(num + 2): continue
      line = aline
    return line

Waterman's "Reservoir Algorithm" translated by Alex Martelli from Knuth's "The Art of Computer Programming"

Could you adapt this for compressed files? I tried setting my compressed file as afile but that didn't work.
Or is there another (easier) way to achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

月下伊人醉 2025-01-13 12:12:59

蒙特卡洛

作为 阅读逐行文件*

(*使用 David Robinson 的方法将 gzip 文件作为标准文件读取):

如果所有行的大小大致相同,您可以跳转到文件中的随机位置,回溯字符按字符排列,直到到达换行并从该点读取整行。如果线条的尺寸完全相同,则此方法是准确的。

然而,如果线条的大小不同,但您知道长度为 x 的线条的分布 - 您可以执行上述方法,但拒绝过多的 < code>x 的概率为 P(x),这样抓取文件中随机行的概率是恒定的。

示例:

为了简单起见,假设您有一个 5 行文件,长度为 X={2,3,5,5,5}。在文件中选择一个随机点,您有 10% (2/(2+3+5+5+5)) 的机会获得 x1,15% 的机会获得 x2x3 的几率为 50%。您想要的概率分别是 20%/20%/60%。我们各自的权重是W=(3/2, 1, 6/5),这些数字使得x1*w1 = 20%x2*w2 = 20%,x3*w3=60%。归一化因子是这些权重的总和 Z = w1+w2+w3 = 37/10。从这里我们知道每条线的概率:

 P(w1) = w1/Z = 30/68
 P(w2) = w2/Z = 20/68
 P(w3) = w3/Z = 18/68

请注意,P(w1)+P(w2)+3*P(w3)=1,正如它应该的那样。

对于您的算法,请在文件中选择一个随机点。如果关联线的长度为 2,请在 q=[0,1] 之间选择一个随机数。如果q>(30/68)拒绝该点并重试。如果小于则停止并返回该行。

你什么时候知道X(w)

我承认,知道行长度的确切分布似乎是有限制的,但是有许多程序生成的文件(日志文件、硬件数据读数等),其中分布是准确已知的。此外,如果仅近似知道分布,我们可以使用上述方法来确定样本拒绝标准作为最佳猜测,并从那里开始。

蒙特卡罗?

这可能不是最好的方法(谁能与 Knuth 竞争?),但它可能提供一些以完全不同的方式解决问题的见解。对于那些不熟悉的人来说,上面的方法是重要性采样的一种形式,一种蒙特卡罗方法。

如何在 gzip 文件中查找?

根据 OP 的要求,这里是通过 Python 文件对象查找的入门知识。

import gzip, random

# Helper function to create some test data
def line(char,n): 
    return ''.join([("%s"%char)*n,"\n"])

# Create the test data as in the example
filename = "test.zip"
FOUT = gzip.open(filename,'wb')
FOUT.write(line('a',2))
FOUT.write(line('b',3))
FOUT.write(line('c',5))
FOUT.write(line('d',5))
FOUT.write(line('e',5))
FOUT.close()

# Since we know the distribution, we know the length
length = 2+3+3*5+5 # 5 newlines

# Print 7 random points in the file
FIN = gzip.open(filename,'rb')
for n in xrange(7):
    FIN.seek(random.randrange(length),0)
    print "Position %3i, char: %s" %(FIN.tell(), [FIN.read(1)])

示例运行的输出如下:

Position   8, char: ['c']
Position  23, char: ['e']
Position  15, char: ['d']
Position  10, char: ['c']
Position   4, char: ['b']
Position  16, char: ['d']
Position   2, char: ['\n']

Monte Carlo

As an alternative to reading the file line by line*

(*use the method by David Robinson to read the gzip file as a standard file):

If all the lines are roughly the same size you could jump to a random position in the file, backtrack character by character until you get to a newline and read the full line from that point. If the lines are exactly the same size this method is exact.

If however the lines are not the same size, but you know the distribution of having a line with length x - you can do the method as above, but reject the overabundant x's with probability P(x) such that the probability of grabbing a random line in the file is constant.

Example:

To make this simple, let's say you have a 5-line file, with lengths X={2,3,5,5,5}. Picking a random point in the file you have a 10% (2/(2+3+5+5+5)) chance of getting x1, 15% of getting x2, 50% chance of x3. What you want is a 20%/20%/60% probability respectively. The respective weights we are are W=(3/2, 1, 6/5), these are the numbers such that x1*w1 = 20%, x2*w2 = 20%, x3*w3=60%. The normalizing factor is the sum of these weights Z = w1+w2+w3 = 37/10. From here we know the probability for each of the lines:

 P(w1) = w1/Z = 30/68
 P(w2) = w2/Z = 20/68
 P(w3) = w3/Z = 18/68

Note that P(w1)+P(w2)+3*P(w3)=1, as it should.

For your algorithm choose a random point in the file. If the associated line has length 2, pick a random number between q=[0,1]. If q>(30/68) reject that spot and try again. If it is less stop and return that line.

When do you know X(w)?

I'll admit that know the exact distribution of the lengths of lines may seem restrictive, however there are many procedurally generated files (log files, hardware data readouts, etc..) where the distribution is known exactly. In addition, if the distribution is known only approximately, we can use the method above to determine the sample rejection criteria as a best guess and go from there.

Monte Carlo?

This may not be the best method (who can compete with Knuth?), but it may offer some insight to solving the problem in a completely different way. For those unfamiliar, the method above is a form of importance sampling, a Monte Carlo method.

How to seek in a gzip file?

Per OP's request, here is a primer on seeking through a Python file object.

import gzip, random

# Helper function to create some test data
def line(char,n): 
    return ''.join([("%s"%char)*n,"\n"])

# Create the test data as in the example
filename = "test.zip"
FOUT = gzip.open(filename,'wb')
FOUT.write(line('a',2))
FOUT.write(line('b',3))
FOUT.write(line('c',5))
FOUT.write(line('d',5))
FOUT.write(line('e',5))
FOUT.close()

# Since we know the distribution, we know the length
length = 2+3+3*5+5 # 5 newlines

# Print 7 random points in the file
FIN = gzip.open(filename,'rb')
for n in xrange(7):
    FIN.seek(random.randrange(length),0)
    print "Position %3i, char: %s" %(FIN.tell(), [FIN.read(1)])

This has the output for a sample run as:

Position   8, char: ['c']
Position  23, char: ['e']
Position  15, char: ['d']
Position  10, char: ['c']
Position   4, char: ['b']
Position  16, char: ['d']
Position   2, char: ['\n']
森林迷了鹿 2025-01-13 12:12:59

您可以简单地使用“从 Python 中的一个文件中读取随机行”方法,但使用 gzip 包。

import gzip
import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
        if random.randrange(num + 2): continue
        line = aline
    return line

afile = gzip.open("myfile.zip")
print random_line(afile)
afile.close()

You can simply use the "read a random line from one file in Python" approach, but open the file as a gzip file rather than a regular file using the gzip package.

import gzip
import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
        if random.randrange(num + 2): continue
        line = aline
    return line

afile = gzip.open("myfile.zip")
print random_line(afile)
afile.close()
木有鱼丸 2025-01-13 12:12:59

请原谅(非常)迟到的答案,但如果您从 gunzip -l 知道文件的大小,则可以使用 seek() 方法在文件中定位。< br>
然后,丢弃下一次读取,因为它可能是部分行,并使用后续读取作为随机数据。

从 gzip 压缩的文本文件中随机打印 10 行。

import random
import gzip, os
f = gzip.open("some.txt.gz","r")
unc_size = os.popen('gunzip -lq some.txt.gz').read()
unc_size = unc_size.strip().split(" ",1)
unc_size = unc_size[1].strip().split(" ",1)
for x in range(1,11):
    f.seek(random.randint(0,int(unc_size[0])))
    dump = next(f)
    print "Random line from byte pos ",f.tell(), next(f)
f.close() 

Forgive the (very) late answer but you can use the seek() method to position in the file, if you know the size of the file from gunzip -l.
Then, throw away the next read, as it will probably be a partial line and use the subsequent read as your random data.

Print 10 random lines from a gzipped text file.

import random
import gzip, os
f = gzip.open("some.txt.gz","r")
unc_size = os.popen('gunzip -lq some.txt.gz').read()
unc_size = unc_size.strip().split(" ",1)
unc_size = unc_size[1].strip().split(" ",1)
for x in range(1,11):
    f.seek(random.randint(0,int(unc_size[0])))
    dump = next(f)
    print "Random line from byte pos ",f.tell(), next(f)
f.close() 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文