对档案中的随机访问提供良好支持的压缩格式?

发布于 2024-07-11 18:00:12 字数 1542 浏览 10 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

月竹挽风 2024-07-18 18:00:12

看一下 dictzip。 它与 gzip 兼容并允许粗略随机访问。

其手册页摘录:

dictzip 使用 gzip(1) 算法 (LZ77) 压缩文件,其方式为
与gzip文件格式完全兼容。 gzip 的扩展
文件格式(Extra Field,在 RFC 1952 的 2.3.1.1 中描述)允许额外数据
存储在压缩文件的标头中。 gzip 和 zcat 等程序
将忽略这个额外的数据。 但是,[dictzcat --start] 将使用
对该数据执行伪随机访问。

我在 Ubuntu 中有 dictzip 包。 或者它的源代码位于 dictd-*.tar.gz 中。 它的许可证是GPL。 你可以自由地研究它。

更新:

我改进了 dictzip,使其没有文件大小限制。
我的实现已获得 MIT 许可。

Take a look at dictzip. It is compatible with gzip and allows coarse random access.

An excerpt from its man page:

dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which
is completely compatible with the gzip file format. An extension to the gzip
file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data
to be stored in the header of a compressed file. Programs like gzip and zcat
will ignore this extra data. However, [dictzcat --start] will make use
of this data to perform pseudo-random access on the file.

I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.

Update:

I improved dictzip to have no file size limit.
My implementation is under MIT license.

千纸鹤带着心事 2024-07-18 18:00:12

我不知道有什么压缩文件格式可以支持随机访问未压缩数据中的特定位置(嗯,多媒体格式除外),但您可以自行编写。

例如,bzip2 压缩文件由大小 <1MB 未压缩的独立压缩块组成,这些块由魔术字节序列分隔,因此您可以解析 bzip2 文件,获取块边界,然后解压缩正确的块。 这需要一些索引来记住块从哪里开始。

不过,我认为最好的解决方案是将文件分割成您选择的块,然后使用一些存档器(例如 zip 或 rar)对其进行压缩,它们支持随机访问存档中的各个文件。

I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.

For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.

Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.

一场春暖 2024-07-18 18:00:12

.xz 文件格式(使用 LZMA 压缩)似乎支持这一点:

随机访问读取:数据可以被分割成独立压缩的块。 每个 .xz 文件都包含块的索引,当块大小足够小时,这使得有限的随机访问读取成为可能。

这应该足以满足您的目的。 一个缺点是 liblzma 的 API(用于与这些容器交互)似乎没有很好的文档记录,因此可能需要花费一些精力来弄清楚如何随机访问块。

The .xz file format (which uses LZMA compression) seems to support this:

Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.

This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.

昵称有卵用 2024-07-18 18:00:12

如果之前已创建索引,则可以随机访问 gzip 格式,如 zlib 的 zran.c 源代码

我在 zlib 的 zran.c 上开发了一个命令行工具,它为 gzip 文件创建索引: https://github.com/circulosmeos/gztool

它甚至可以为仍在增长的 gzip 文件创建索引(例如 rsyslog 直接以 gzip 格式创建的日志),从而减少在实践中将索引创建时间归零。 请参阅-S监督)选项。

The gzip format can be randomly accessed provided an index has been previously created, as it is demonstrated on zlib's zran.c source code.

I've developed a command line tool upon zlib's zran.c which creates indexes for gzip files: https://github.com/circulosmeos/gztool

It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

ヤ经典坏疍 2024-07-18 18:00:12

bgzip 可以压缩可索引的 gzip 变体中的文件(并且可以通过 gzip 解压缩)。 它与 tabix 索引器一起用于一些生物信息学应用程序。

请参阅此处的说明:http://blastedbio.blogspot。 fr/2011/11/bgzf-blocked-bigger-better-gzip.html,这里:http://www.htslib.org/doc/tabix.html

我不知道它在多大程度上适用于其他应用程序。

bgzip can compress files in a gzip variant which is indexable (and can be decompressed by gzip). This is used in some bioinformatics applications, together with the tabix indexer.

See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.

I don't know to what extent it is adaptable to other applications.

記憶穿過時間隧道 2024-07-18 18:00:12

因为无损压缩在某些领域比其他领域效果更好,
如果将压缩数据存储到长度合适的 BLOCKSIZE 块中,即使每个块具有完全相同数量的压缩字节,某些压缩块也会扩展为比其他块更长的明文。

你可能会看看
“压缩:下一代文本检索系统的关键”
作者:Nivio Ziviani、Edleno Silva de Moura、Gonzalo Navarro 和 Ricardo Baeza-Yates

计算机杂志 2000 年 11 月
http://doi.ieeecomputersociety.org/10.1109/2.881693

他们的解压器需要 1, 2,或 3 个完整字节的压缩数据并解压缩(使用词汇列表)为整个单词。
可以直接在压缩文本中搜索单词或短语,
事实证明,这比搜索未压缩的文本还要快。

他们的解压缩器允许您使用普通(字节)指针指向文本中的任何单词,并立即从该点开始解压缩。

您可以为每个单词指定一个唯一的 2 字节代码,因为文本中的唯一单词可能少于 65,000 个。
(KJV 圣经中有近 13,000 个独特的单词)。
即使有超过 65,000 个单词,将前 256 个两字节代码“单词”分配给所有可能的字节也非常简单,因此您可以拼出不在 65,000 个左右“最常见”的词典中的单词。单词和短语”。
(通过将频繁的单词和短语打包到两个字节中获得的压缩
通常值得“扩展”偶尔使用每个字母两个字节拼写出一个单词)。
有多种方法可以选择“常用单词和短语”词典来提供足够的压缩。
例如,您可以调整 LZW 压缩器,将其多次使用的“短语”转储到词典文件中,每个短语一行,然后对所有数据运行它。
或者,您可以在词典文件中任意将未压缩数据分割成 5 字节短语,每个短语一行。
或者,您可以将未压缩的数据切分成实际的英语单词,并将每个单词(包括单词开头的空格)放入词典文件中。
然后使用“sort --unique”消除该词典文件中的重复单词。
(选择完美的“最佳”词典单词列表仍然被认为是 NP 困难吗?)

将词典存储在巨大的压缩文件的开头,将其填充到一些方便的 BLOCKSIZE,然后存储压缩文本 - 一系列两个字节“words”——从那里到文件末尾。
想必搜索者会读取该词典一次,并在解压缩期间将其以某种快速解码格式保存在 RAM 中,以加快将“两字节代码”解压缩为“可变长度短语”的速度。
我的第一份草稿将从每个短语列表一个简单的行开始,但您稍后可能会转而使用某种增量编码或 zlib 以更压缩的形式存储词典。

您可以在压缩文本中选择任何随机偶数字节偏移量,然后从那里开始解压缩。
我认为不可能制作更细粒度的随机访问压缩文件格式。

Because lossless compression works better on some areas than others,
if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.

You might look at
"Compression: A Key for Next-Generation Text Retrieval Systems"
by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates
in
Computer magazine November 2000
http://doi.ieeecomputersociety.org/10.1109/2.881693

Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word.
One can directly search the compressed text for words or phrases,
which turns out to be even faster than searching uncompressed text.

Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.

You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text.
(There are almost 13,000 unique words in the KJV Bible).
Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases".
(The compression gained by packing frequent words and phrases into two bytes
is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter).
There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression.
For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data.
Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase.
Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file.
Then use "sort --unique" to eliminate duplicate words in that lexicon file.
(Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)

Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file.
Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase".
My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.

You can pick any random even byte offset into the compressed text, and start decompressing from there.
I don't think it's possible to make a finer-grained random access compressed file format.

叶落知秋 2024-07-18 18:00:12

两种可能的解决方案:

  1. 让操作系统处理压缩,创建并安装包含所有文本文件的压缩文件系统(SquashFS、clicfs、cloop、cramfs、e2compr 或其他),并且不对应用程序中的压缩执行任何操作 让操作系统处理压缩,创建并安装包含所有文本文件的

  2. 直接在每个文本文件上使用 clicfs(每个文本文件一个 clicfs),而不是压缩文件系统映像。 将“mkclicfs mytextfile mycompressedfile”视为“gzipmycompressedfile”和“clicfs mycompressedfile directory”作为通过文件“directory/mytextfile”随机访问数据的一种方式。

Two possible solutions:

  1. Let the OS deal with compression, create and mount a compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr or whatever) containing all your text files and don't do anything about compression in your application program.

  2. Use clicfs directly on each text file (one clicfs per text file) instead of compressing a filesystem image. Think of "mkclicfs mytextfile mycompressedfile" being "gzip <mytextfile >mycompressedfile" and "clicfs mycompressedfile directory" as a way of getting random access to the data via the file "directory/mytextfile".

迷鸟归林 2024-07-18 18:00:12

我不知道它是否已经被提及,但是 Kiwix 项目 在对此。 通过他们的 Kiwix 程序,他们提供对 ZIM 文件档案的随机访问。 压缩也不错。 该项目起源于对维基百科离线副本的需求(未压缩形式的维基百科已达到 100 GB 以上,包括所有媒体)。 他们成功地获取了一个 25 GB 的文件(维基百科的单文件体现,没有大部分媒体)并将其压缩为区区 8 GB 的 zim 文件存档。 通过 Kiwix 程序,您可以调用维基百科的任何页面以及所有相关数据,速度比上网还要快。

尽管 Kiwix 程序是一种基于维基百科数据库结构的技术,但它证明您可以同时拥有出色的压缩率和随机访问。

I don't know if its been mentioned yet, but the Kiwix project had done great work in this regard. Through their program Kiwix, they offer random access to ZIM file archives. Good compression, too. The project originated when there was a demand for offline copies of the Wikipedia (which has reached above 100 GB in uncompressed form, with all media included). They have successfully taken a 25 GB file (a single-file embodiment of the Wikipedia without most of the media) and compressed it to a measly 8 GB zim file archive. And through the Kiwix program, you can call up any page of the Wikipedia, with all associated data, faster than you can surfing the net.

Even though Kiwix program is a technology based around the Wikipedia database structure, it proves that you can have excellent compression ratios and random access simultaneously.

随梦而飞# 2024-07-18 18:00:12

我不确定这在您的具体情况下是否实用,但是您不能将每个大文件压缩为较小的文件(例如每个 10 MB)吗? 您最终会得到一堆文件:file0.gz、file1.gz、file2.gz 等。根据原始大文件中给定的偏移量,您可以在名为 "file" + (offset / 10485760)+“.gz”。 未压缩存档内的偏移量将为 offset % 10485760

I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named "file" + (offset / 10485760) + ".gz". The offset within the uncompressed archive would be offset % 10485760.

我的奇迹 2024-07-18 18:00:12

我是一个用于压缩特定类型生物数据的开源工具的作者。 该工具称为“淀粉”,按染色体分割数据,并使用这些分割作为索引,以便快速访问较大档案中的压缩数据单元。

对每染色体数据进行转换以消除基因组坐标中的冗余,并使用 bzip2gzip 算法压缩转换后的数据。 偏移量、元数据和压缩的基因组数据连接到一个文件中。

源代码可从我们的 GitHub 网站获取。 我们已经在 Linux 和 Mac OS X 下编译了它。

对于您的情况,您可以将标头中的偏移量(10 MB 或其他)存储为自定义存档格式。 您可以解析标头,检索偏移量,并通过 current_offset_sum + header_size 增量地fseek 遍历文件。

I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.

Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2 or gzip algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.

Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.

For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek through the file by current_offset_sum + header_size.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文