如何在Python中从gzip压缩文件中获取随机行而不将其读入内存
假设我有一个 531 gig 的 gzip 文本文件,其中有 512 548 457 601 475 行,由 '\n' 分割,并且希望在不进行文件分割的情况下从中获取随机行。 (别担心,它并不是真的那么大;只是想说明它是一个巨大的文件,并且我知道它有多少行。)
我通常如何使用较小的压缩文件来做到这一点:
import fileinput
import gzip
import random
list = []
for line in fileinput.input(file, openhook=gzip.open):
list.append(line)
listLength = len(list)
randomListLineOne = line[random.randint(0, listLength)]
randomListLineTwo = line[random.randint(0, listLength)]
...
我在主题:
import random
def random_line(afile):
line = next(afile)
for num, aline in enumerate(afile):
if random.randrange(num + 2): continue
line = aline
return line
Waterman 的“Reservoir Algorithm”,由 Alex Martelli 翻译自 Knuth 的“计算机编程的艺术”
您能将其改编为压缩文件吗?我尝试将压缩文件设置为文件,但这不起作用。 或者还有另一种(更简单的)方法来实现这一目标?
Let's say I have a 531 gig gzipped textfile with exactly 512 548 457 601 475 lines split by '\n' and wanted to get a random line out of it without filesplitting. (Don't worry, it's not really THAT large; just wanted to state that it's a huge file and I know how many lines it has.)
How I would normally do it with a smaller compressed file:
import fileinput
import gzip
import random
list = []
for line in fileinput.input(file, openhook=gzip.open):
list.append(line)
listLength = len(list)
randomListLineOne = line[random.randint(0, listLength)]
randomListLineTwo = line[random.randint(0, listLength)]
...
What I've found on the topic:
How do I read a random line from one file in python?
import random
def random_line(afile):
line = next(afile)
for num, aline in enumerate(afile):
if random.randrange(num + 2): continue
line = aline
return line
Waterman's "Reservoir Algorithm" translated by Alex Martelli from Knuth's "The Art of Computer Programming"
Could you adapt this for compressed files? I tried setting my compressed file as afile but that didn't work.
Or is there another (easier) way to achieve this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
蒙特卡洛
作为 阅读逐行文件*
(*使用 David Robinson 的方法将 gzip 文件作为标准文件读取):
如果所有行的大小大致相同,您可以跳转到文件中的随机位置,回溯字符按字符排列,直到到达换行并从该点读取整行。如果线条的尺寸完全相同,则此方法是准确的。
然而,如果线条的大小不同,但您知道长度为 x 的线条的分布 - 您可以执行上述方法,但拒绝过多的 < code>x 的概率为
P(x)
,这样抓取文件中随机行的概率是恒定的。示例:
为了简单起见,假设您有一个 5 行文件,长度为
X={2,3,5,5,5}
。在文件中选择一个随机点,您有 10% (2/(2+3+5+5+5)) 的机会获得x1
,15% 的机会获得x2
,x3
的几率为 50%。您想要的概率分别是20%/20%/60%
。我们各自的权重是W=(3/2, 1, 6/5)
,这些数字使得x1*w1 = 20%
,x2*w2 = 20%,
x3*w3=60%
。归一化因子是这些权重的总和Z = w1+w2+w3 = 37/10
。从这里我们知道每条线的概率:请注意,
P(w1)+P(w2)+3*P(w3)=1
,正如它应该的那样。对于您的算法,请在文件中选择一个随机点。如果关联线的长度为 2,请在 q=[0,1] 之间选择一个随机数。如果
q>(30/68)
拒绝该点并重试。如果小于则停止并返回该行。你什么时候知道
X(w)
?我承认,知道行长度的确切分布似乎是有限制的,但是有许多程序生成的文件(日志文件、硬件数据读数等),其中分布是准确已知的。此外,如果仅近似知道分布,我们可以使用上述方法来确定样本拒绝标准作为最佳猜测,并从那里开始。
蒙特卡罗?
这可能不是最好的方法(谁能与 Knuth 竞争?),但它可能提供一些以完全不同的方式解决问题的见解。对于那些不熟悉的人来说,上面的方法是重要性采样的一种形式,一种蒙特卡罗方法。
如何在 gzip 文件中查找?
根据 OP 的要求,这里是通过 Python 文件对象
查找
的入门知识。示例运行的输出如下:
Monte Carlo
As an alternative to reading the file line by line*
(*use the method by David Robinson to read the gzip file as a standard file):
If all the lines are roughly the same size you could jump to a random position in the file, backtrack character by character until you get to a newline and read the full line from that point. If the lines are exactly the same size this method is exact.
If however the lines are not the same size, but you know the distribution of having a line with length
x
- you can do the method as above, but reject the overabundantx
's with probabilityP(x)
such that the probability of grabbing a random line in the file is constant.Example:
To make this simple, let's say you have a 5-line file, with lengths
X={2,3,5,5,5}
. Picking a random point in the file you have a 10% (2/(2+3+5+5+5)) chance of gettingx1
, 15% of gettingx2
, 50% chance ofx3
. What you want is a20%/20%/60%
probability respectively. The respective weights we are areW=(3/2, 1, 6/5)
, these are the numbers such thatx1*w1 = 20%
,x2*w2 = 20%
,x3*w3=60%
. The normalizing factor is the sum of these weightsZ = w1+w2+w3 = 37/10
. From here we know the probability for each of the lines:Note that
P(w1)+P(w2)+3*P(w3)=1
, as it should.For your algorithm choose a random point in the file. If the associated line has length 2, pick a random number between
q=[0,1]
. Ifq>(30/68)
reject that spot and try again. If it is less stop and return that line.When do you know
X(w)
?I'll admit that know the exact distribution of the lengths of lines may seem restrictive, however there are many procedurally generated files (log files, hardware data readouts, etc..) where the distribution is known exactly. In addition, if the distribution is known only approximately, we can use the method above to determine the sample rejection criteria as a best guess and go from there.
Monte Carlo?
This may not be the best method (who can compete with Knuth?), but it may offer some insight to solving the problem in a completely different way. For those unfamiliar, the method above is a form of importance sampling, a Monte Carlo method.
How to seek in a gzip file?
Per OP's request, here is a primer on
seek
ing through a Python file object.This has the output for a sample run as:
您可以简单地使用“从 Python 中的一个文件中读取随机行”方法,但使用 gzip 包。
You can simply use the "read a random line from one file in Python" approach, but open the file as a gzip file rather than a regular file using the gzip package.
请原谅(非常)迟到的答案,但如果您从
gunzip -l
知道文件的大小,则可以使用seek()
方法在文件中定位。< br>然后,丢弃下一次读取,因为它可能是部分行,并使用后续读取作为随机数据。
从 gzip 压缩的文本文件中随机打印 10 行。
Forgive the (very) late answer but you can use the
seek()
method to position in the file, if you know the size of the file fromgunzip -l
.Then, throw away the next read, as it will probably be a partial line and use the subsequent read as your random data.
Print 10 random lines from a gzipped text file.