在 Python 中解压 .bz2 文件

发布于 2024-07-30 20:53:41 字数 1572 浏览 7 评论 0原文

所以，这是一个看似简单的问题，但我显然非常非常迟钝。我有一个小脚本，可以从网页下载所有 .bz2 文件，但由于某种原因，该文件的解压缩让我非常头疼。

我是一个Python新手，所以答案可能很明显，请帮助我。

在脚本的这段中，我已经有了该文件，我只想将其读出到一个变量中，然后解压缩它？是对的吗？我尝试了各种方法来做到这一点，我通常会在此片段的最后一行收到“ValueError：找不到流结束”错误。我尝试打开 zip 文件并以无数种不同的方式将其写入字符串。这是最新的。

openZip = open(zipFile, "r")
s = ''
while True:
    newLine = openZip.readline()
    if(len(newLine)==0):
       break
    s+=newLine
    print s                   
    uncompressedData = bz2.decompress(s)

嗨，亚历克斯，我应该列出我尝试过的所有其他方法，因为我已经尝试过 read() 方式。

方法 A：

print 'decompressing ' + filename

fileHandle = open(zipFile)
uncompressedData = ''

while True:            
    s = fileHandle.read(1024)
    if not s:
        break
        print('RAW "%s"', s)
        uncompressedData += bz2.decompress(s)

        uncompressedData += bz2.flush()

        newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
        newFile.write(uncompressedData)
        newFile.close()

我收到错误：

uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream

方法 B

zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)

s = fileHandle.read()
uncompressedData = bz2.decompress(s)

相同的错误：

uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream

非常感谢您的及时回复。我真的很用头撞墙，因为无法解压一个简单的 .bz2 文件而感到异常沉重。

顺便说一句，使用 7zip 手动解压它，以确保文件没有问题或任何问题，并且解压正常。

原文

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.

I'm quite a Python newbie, so the answer is probably quite obvious, please help me.

In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.

openZip = open(zipFile, "r")
s = ''
while True:
    newLine = openZip.readline()
    if(len(newLine)==0):
       break
    s+=newLine
    print s                   
    uncompressedData = bz2.decompress(s)

Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.

METHOD A:

print 'decompressing ' + filename

fileHandle = open(zipFile)
uncompressedData = ''

while True:            
    s = fileHandle.read(1024)
    if not s:
        break
        print('RAW "%s"', s)
        uncompressedData += bz2.decompress(s)

        uncompressedData += bz2.flush()

        newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
        newFile.write(uncompressedData)
        newFile.close()

I get the error:

uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream

METHOD B

zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)

s = fileHandle.read()
uncompressedData = bz2.decompress(s)

Same error :

uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream

Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.

By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黄昏下泛黄的笔记 2024-08-06 20:53:45

这非常有帮助。
在 Windows 打开时，2300 个文件中有 44 个出现文件结尾丢失错误。
添加 b(inary) 标志来打开修复了问题。

for line in bz2.BZ2File(filename, 'rb', 10000000) :

效果很好。（10M 是适合处理大文件的缓冲大小）

谢谢！

This was very helpful.
44 of 2300 files gave an end of file missing error, on Windows open.
Adding the b(inary) flag to open fixed the problem.

for line in bz2.BZ2File(filename, 'rb', 10000000) :

works well. (the 10M is the buffering size that works well with the large files involved)

Thanks!

回复收藏 0 原文

许仙没带伞 2024-08-06 20:53:43

openZip = open(zipFile, "r")

如果您在 Windows 上运行，您可能需要在此处输入 openZip = open(zipFile, "rb") 因为该文件可能包含 CR /LF 组合，并且您不希望它们被翻译。

newLine = openZip.readline()

正如 Alex 指出的，这是非常错误的，因为“行”的概念对于压缩流来说是陌生的。

s = fileHandle.read(1024)
[...]
uncompressedData += bz2.decompress(s)

出于同样的原因，这是错误的。 1024 字节的块对于解压缩器来说可能没有多大意义，因为它需要使用自己的块大小。

s = fileHandle.read()
uncompressedData = bz2.decompress(s)

如果这不起作用，我会说这是我上面提到的换行转换问题。

回复收藏 0 原文

葬花如无物 2024-08-06 20:53:43

您打开并读取压缩文件，就像它是由行组成的文本文件一样。不！它不是。

uncompressedData = bz2.BZ2File(zipFile).read()

似乎更接近你想要的东西。

编辑：OP展示了他尝试过的更多东西（尽管我没有看到任何关于尝试过最佳方法的注释——我上面推荐的一句话！）但它们似乎都有一个共同的错误，我重复上面的关键部分：

打开...压缩文件，就像
它是一个文本文件......它不是。

open(filename) 甚至更明确的 open(filename, 'r') 打开一个文本文件——一个压缩的文件以供读取file 是一个二进制文件，因此为了正确读取它，您必须使用open(filename, 'rb') 打开它。（（当然，我推荐的 bz2.BZ2File 知道它正在处理压缩文件，因此无需再告诉它任何内容））。

在 Python 2.* 中，在 Unix-y 系统（即除 Windows 之外的所有系统）上，您可以随意使用 open （但在 Python 3.* 你不能，因为文本是 Unicode，而二进制是字节——不同的类型）。

在 Windows 中（以及在此之前的 DOS 中），区分始终是必不可少的，因为由于历史原因，Windows 的文本文件很特殊（使用两个字节而不是一个字节来结束行，并且至少在某些情况下，占用一个字节值 '\0x1A' 表示文件的逻辑结束），因此读写低级代码必须进行补偿。

因此，我怀疑 OP 使用的是 Windows，并且因为不小心使用内置的 open 选项（“读取二进制文件”）而付出了代价。（尽管 bz2.BZ2File 仍然更简单，无论您使用什么平台！-）。

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.

uncompressedData = bz2.BZ2File(zipFile).read()

seems to be closer to what you're angling for.

Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:

opening ... the compressed file as if
it was a textfile ... It's NOT.

open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).

In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).

In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.

So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).

回复收藏 0 原文

~没有更多了~