如何在不读取压缩文件全部内容的情况下对其进行尾随?

发布于 2024-07-30 02:15:14 字数 183 浏览 2 评论 0原文

我想模拟gzcat的功能 | 尾部-n.

当存在巨大文件(几 GB 左右)时,这会很有帮助。 我可以尾随此类文件的最后几行而不从头读取它吗? 我怀疑这是不可能的,因为我猜测对于 gzip,编码将取决于之前的所有文本。

但我仍然想听听是否有人尝试过做类似的事情 - 也许调查可以提供此类功能的压缩算法。

I want to emulate the functionality of gzcat | tail -n.

This would be helpful for times when there are huge files (of a few GB's or so). Can I tail the last few lines of such a file w/o reading it from the beginning? I doubt that this won't be possible since I'd guess for gzip, the encoding would depend on all the previous text.

But still I'd like to hear if anyone has tried doing something similar - maybe investigating over a compression algorithm that could provide such a feature.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

空气里的味道 2024-08-06 02:15:14

不,你不能。 压缩算法适用于流并调整其对流包含的内容进行内部编码以实现其高压缩比。

如果不知道某个点之前流的内容是什么,就不可能知道如何从该点开始解压缩。

任何允许您解压缩其任意部分的算法都需要多次传递数据才能压缩它。

No, you can't. The zipping algorithm works on streams and adapts its internal codings to what the stream contains to achieve its high compression ratio.

Without knowing what the contents of the stream are before a certain point, it's impossible to know how to go about de-compressing from that point on.

Any algorithm which allows you to de-compress arbitrary parts of it will require multiple passes over the data to compress it.

不美如何 2024-08-06 02:15:14

BGZF 用于创建 Samtools 创建的索引 gzip 压缩 BAM 文件。 这些是可以随机访问的。

http://samtools.sourceforge.net/

BGZF is used to created index gzip compressed BAM files created by Samtools. These are randomly accessible.

http://samtools.sourceforge.net/

大姐,你呐 2024-08-06 02:15:14

如果您首先可以控制文件中的内容,如果它是类似 ZIP 文件的文件,您可以存储预定大小的块,文件名按递增的数字顺序排列,然后解压缩最后一个块/文件。

If you have control over what goes into the file in the first place, if it's anything like a ZIP file you could store chunks of predetermined size with filenames in increasing numerical order and then just decompress the last chunk/file.

秋叶绚丽 2024-08-06 02:15:14

如果可以选择,那么 bzip2 可能是用于此目的的更好的压缩算法。

Bzip2 使用块压缩方案。 因此,如果您确定文件末尾的一块足够大以包含最后一块的所有内容,那么您可以使用 bzip2recover 恢复它。

块大小可以在写入文件时选择。 事实上,当您将 -1(或 --fast)设置为 -9(或 --best)作为压缩选项(对应于 100k 到 900k 的块大小)时,就会发生这种情况。 默认值为 900k。

bzip2 命令行工具没有为您提供一种友好的管道方式来执行此操作,但考虑到 bzip2 不是面向流的,也许这并不奇怪。

If it's an option, then bzip2 might be a better compression algorithm to use for this purpose.

Bzip2 uses a block compression scheme. As such, if you take a chunk of the end of your file which you are sure is large enough to contain all of the last chunk, then you can recover it with bzip2recover.

The block size is selectable at the time the file is written. In fact that's what happens when you set -1 (or --fast) to -9 (or --best) as compression options, which correspond to block sizes of 100k to 900k. The default is 900k.

The bzip2 command line tools don't give you a nice friendly way to do this with a pipeline, but then given bzip2 is not stream oriented, perhaps that's not surprising.

娇柔作态 2024-08-06 02:15:14

zindex 以节省时间和空间的方式在压缩的、基于行的文本文件上创建和查询索引。

https://github.com/mattgodbolt/zindex

zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.

https://github.com/mattgodbolt/zindex

不乱于心 2024-08-06 02:15:14

好吧,如果您之前为每个文件创建了一个索引,您就可以做到这一点...

我开发了一个命令行工具,它可以为 gzip 文件创建索引,该工具允许在其中进行非常快速的随机访问,并且它与操作(提取、尾部、连续尾部等)交错执行此操作:
https://github.com/circulosmeos/gztool

但是你可以做一个尾巴(- t),并且索引将自动创建:如果您将来要做同样的事情,它会快得多,而且无论如何,第一次它会花费与gunzip相同的时间| 尾部:

$ gztool -t my_file.gz

Well, you can do that if you previously creates an index for each file ...

I've developed a command line tool which creates indexes for gzip files which allow for very quick random access inside them, and it does this interleaved with actions (extract, tail, continuous tail, etc):
https://github.com/circulosmeos/gztool

But you can do a tail (-t), and the index will be automatically created: if you're gonna do the same in the future it'll be much quicker, and anyway the first time it will take the same time as a gunzip | tail:

$ gztool -t my_file.gz
甚是思念 2024-08-06 02:15:14

完全兼容 gzip 的伪随机访问格式的示例是 dictzip

为了压缩,文件被分成数据“块”,每个数据块
块小于 64kB。 [...]

对数据进行随机访问,偏移量和长度
数据被提供给库例程。 这些程序决定了
所需数据开始的块,并解压缩该块。
根据需要解压缩连续的块。”

An example of a fully gzip-compatible pseudo-random access format is dictzip:

For compression, the file is divided up into "chunks" of data, each
chunk is less than 64kB. [...]

To perform random access on the data, the offset and length of the
data are provided to library routines. These routines determine the
chunk in which the desired data begins, and decompresses that chunk.
Consecutive chunks are decompressed as necessary."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文