使用 python 组织 tar bz2 文件中的文件

发布于 2024-09-14 12:27:22 字数 251 浏览 9 评论 0原文

我有大约 200,000 个文本文件放置在 bz2 文件中。我遇到的问题是,当我扫描 bz2 文件以提取我需要的数据时,速度非常慢。它必须查看整个 bz2 文件才能找到我正在查找的单个文件。有什么办法可以加快速度吗?

另外,我考虑过可能会组织 tar.bz2 中的文件,这样我就可以让它知道在哪里查找。有没有办法组织放入 bz2 中的文件?

更多信息/编辑: 我需要查询每个文本文件的压缩文件。有没有更好的压缩方法可以支持如此大量的文件并且压缩得同样彻底?

I have about 200,000 text files that are placed in a bz2 file. The issue I have is that when I scan the bz2 file to extract the data I need, it goes extremely slow. It has to look through the entire bz2 file to fine the single file I am looking for. Is there anyway to speed this up?

Also, I thought about possibly organizing the files in the tar.bz2 so I can instead have it know where to look. Is there anyway to organize files that are put into a bz2?

More Info/Edit:
I need to query the compressed file for each textfile. Is there a better compression method that supports such a large number of files and is as thoroughly compressed?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜司空 2024-09-21 12:27:22

必须使用bzip2吗?阅读它的文档,很明显它不是设计的支持随机访问。也许您应该使用更符合您要求的压缩格式。良好的旧 Zip 格式支持随机访问,但当然压缩效果可能更差。

Do you have to use bzip2? Reading it's documentation, it's quite clear it's not designed to support random access. Perhaps you should use a compression format that more closely matches your requirements. The good old Zip format supports random access, but might compress worse, of course.

黑凤梨 2024-09-21 12:27:22

Bzip2 以大块压缩(我相信默认情况下为 900 KiB)。一种可以显着加快 tar 文件扫描速度但会降低压缩性能的方法是单独压缩每个文件,然后将结果打包在一起。这本质上就是 Zip 格式文件(尽管使用 zlib 压缩而不是 bzip2)。但是您可以轻松获取 tar 索引,并且只需解压缩您正在查找的特定文件。

我不认为大多数 tar 程序提供了以任何有意义的方式组织文件的能力,尽管您可以编写一个程序来针对您的特殊情况执行此操作(我知道 Python 有 tar 编写库,尽管我只使用过一两次)。但是,在找到所需内容之前,您仍然会遇到必须解压缩大部分数据的问题。

Bzip2 compresses in large blocks (900 KiB by default, I believe). One method that would speed up the scanning of the tar file dramatically, but would reduce compression performance, would be to compress each file individually and then tar the results together. This is essentially what Zip-format files are (though using zlib compression rather than bzip2). But you could then easily grab the tar index and only have to decompress the specific file(s) you are looking for.

I don't think most tar programs offer much ability to organize files in any meaningful way, though you could write a program to do this for your special case (I know Python has tar-writing libraries though I've only used them once or twice). However, you'd still have the problem of having to decompress most of the data before you found what you were looking for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文