We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(10)
看一下 dictzip。 它与 gzip 兼容并允许粗略随机访问。
其手册页摘录:
我在 Ubuntu 中有 dictzip 包。 或者它的源代码位于 dictd-*.tar.gz 中。 它的许可证是GPL。 你可以自由地研究它。
更新:
我改进了 dictzip,使其没有文件大小限制。
我的实现已获得 MIT 许可。
Take a look at dictzip. It is compatible with gzip and allows coarse random access.
An excerpt from its man page:
I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.
Update:
I improved dictzip to have no file size limit.
My implementation is under MIT license.
我不知道有什么压缩文件格式可以支持随机访问未压缩数据中的特定位置(嗯,多媒体格式除外),但您可以自行编写。
例如,bzip2 压缩文件由大小 <1MB 未压缩的独立压缩块组成,这些块由魔术字节序列分隔,因此您可以解析 bzip2 文件,获取块边界,然后解压缩正确的块。 这需要一些索引来记住块从哪里开始。
不过,我认为最好的解决方案是将文件分割成您选择的块,然后使用一些存档器(例如 zip 或 rar)对其进行压缩,它们支持随机访问存档中的各个文件。
I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.
For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.
Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.
.xz 文件格式(使用 LZMA 压缩)似乎支持这一点:
这应该足以满足您的目的。 一个缺点是 liblzma 的 API(用于与这些容器交互)似乎没有很好的文档记录,因此可能需要花费一些精力来弄清楚如何随机访问块。
The .xz file format (which uses LZMA compression) seems to support this:
This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.
如果之前已创建索引,则可以随机访问 gzip 格式,如 zlib 的 zran.c 源代码。
我在 zlib 的 zran.c 上开发了一个命令行工具,它为 gzip 文件创建索引: https://github.com/circulosmeos/gztool
它甚至可以为仍在增长的 gzip 文件创建索引(例如 rsyslog 直接以 gzip 格式创建的日志),从而减少在实践中将索引创建时间归零。 请参阅
-S
(监督)选项。The gzip format can be randomly accessed provided an index has been previously created, as it is demonstrated on zlib's zran.c source code.
I've developed a command line tool upon zlib's zran.c which creates indexes for gzip files: https://github.com/circulosmeos/gztool
It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the
-S
(Supervise) option.bgzip
可以压缩可索引的gzip
变体中的文件(并且可以通过gzip
解压缩)。 它与tabix
索引器一起用于一些生物信息学应用程序。请参阅此处的说明:http://blastedbio.blogspot。 fr/2011/11/bgzf-blocked-bigger-better-gzip.html,这里:http://www.htslib.org/doc/tabix.html。
我不知道它在多大程度上适用于其他应用程序。
bgzip
can compress files in agzip
variant which is indexable (and can be decompressed bygzip
). This is used in some bioinformatics applications, together with thetabix
indexer.See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.
I don't know to what extent it is adaptable to other applications.
因为无损压缩在某些领域比其他领域效果更好,
如果将压缩数据存储到长度合适的 BLOCKSIZE 块中,即使每个块具有完全相同数量的压缩字节,某些压缩块也会扩展为比其他块更长的明文。
你可能会看看
“压缩:下一代文本检索系统的关键”
作者:Nivio Ziviani、Edleno Silva de Moura、Gonzalo Navarro 和 Ricardo Baeza-Yates
在
计算机杂志 2000 年 11 月
http://doi.ieeecomputersociety.org/10.1109/2.881693
他们的解压器需要 1, 2,或 3 个完整字节的压缩数据并解压缩(使用词汇列表)为整个单词。
可以直接在压缩文本中搜索单词或短语,
事实证明,这比搜索未压缩的文本还要快。
他们的解压缩器允许您使用普通(字节)指针指向文本中的任何单词,并立即从该点开始解压缩。
您可以为每个单词指定一个唯一的 2 字节代码,因为文本中的唯一单词可能少于 65,000 个。
(KJV 圣经中有近 13,000 个独特的单词)。
即使有超过 65,000 个单词,将前 256 个两字节代码“单词”分配给所有可能的字节也非常简单,因此您可以拼出不在 65,000 个左右“最常见”的词典中的单词。单词和短语”。
(通过将频繁的单词和短语打包到两个字节中获得的压缩
通常值得“扩展”偶尔使用每个字母两个字节拼写出一个单词)。
有多种方法可以选择“常用单词和短语”词典来提供足够的压缩。
例如,您可以调整 LZW 压缩器,将其多次使用的“短语”转储到词典文件中,每个短语一行,然后对所有数据运行它。
或者,您可以在词典文件中任意将未压缩数据分割成 5 字节短语,每个短语一行。
或者,您可以将未压缩的数据切分成实际的英语单词,并将每个单词(包括单词开头的空格)放入词典文件中。
然后使用“sort --unique”消除该词典文件中的重复单词。
(选择完美的“最佳”词典单词列表仍然被认为是 NP 困难吗?)
将词典存储在巨大的压缩文件的开头,将其填充到一些方便的 BLOCKSIZE,然后存储压缩文本 - 一系列两个字节“words”——从那里到文件末尾。
想必搜索者会读取该词典一次,并在解压缩期间将其以某种快速解码格式保存在 RAM 中,以加快将“两字节代码”解压缩为“可变长度短语”的速度。
我的第一份草稿将从每个短语列表一个简单的行开始,但您稍后可能会转而使用某种增量编码或 zlib 以更压缩的形式存储词典。
您可以在压缩文本中选择任何随机偶数字节偏移量,然后从那里开始解压缩。
我认为不可能制作更细粒度的随机访问压缩文件格式。
Because lossless compression works better on some areas than others,
if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.
You might look at
"Compression: A Key for Next-Generation Text Retrieval Systems"
by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates
in
Computer magazine November 2000
http://doi.ieeecomputersociety.org/10.1109/2.881693
Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word.
One can directly search the compressed text for words or phrases,
which turns out to be even faster than searching uncompressed text.
Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.
You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text.
(There are almost 13,000 unique words in the KJV Bible).
Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases".
(The compression gained by packing frequent words and phrases into two bytes
is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter).
There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression.
For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data.
Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase.
Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file.
Then use "sort --unique" to eliminate duplicate words in that lexicon file.
(Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)
Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file.
Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase".
My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.
You can pick any random even byte offset into the compressed text, and start decompressing from there.
I don't think it's possible to make a finer-grained random access compressed file format.
两种可能的解决方案:
让操作系统处理压缩,创建并安装包含所有文本文件的压缩文件系统(SquashFS、clicfs、cloop、cramfs、e2compr 或其他),并且不对应用程序中的压缩执行任何操作 让操作系统处理压缩,创建并安装包含所有文本文件的
直接在每个文本文件上使用 clicfs(每个文本文件一个 clicfs),而不是压缩文件系统映像。 将“mkclicfs mytextfile mycompressedfile”视为“gzipmycompressedfile”和“clicfs mycompressedfile directory”作为通过文件“directory/mytextfile”随机访问数据的一种方式。
Two possible solutions:
Let the OS deal with compression, create and mount a compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr or whatever) containing all your text files and don't do anything about compression in your application program.
Use clicfs directly on each text file (one clicfs per text file) instead of compressing a filesystem image. Think of "mkclicfs mytextfile mycompressedfile" being "gzip <mytextfile >mycompressedfile" and "clicfs mycompressedfile directory" as a way of getting random access to the data via the file "directory/mytextfile".
我不知道它是否已经被提及,但是 Kiwix 项目 在对此。 通过他们的 Kiwix 程序,他们提供对 ZIM 文件档案的随机访问。 压缩也不错。 该项目起源于对维基百科离线副本的需求(未压缩形式的维基百科已达到 100 GB 以上,包括所有媒体)。 他们成功地获取了一个 25 GB 的文件(维基百科的单文件体现,没有大部分媒体)并将其压缩为区区 8 GB 的 zim 文件存档。 通过 Kiwix 程序,您可以调用维基百科的任何页面以及所有相关数据,速度比上网还要快。
尽管 Kiwix 程序是一种基于维基百科数据库结构的技术,但它证明您可以同时拥有出色的压缩率和随机访问。
I don't know if its been mentioned yet, but the Kiwix project had done great work in this regard. Through their program Kiwix, they offer random access to ZIM file archives. Good compression, too. The project originated when there was a demand for offline copies of the Wikipedia (which has reached above 100 GB in uncompressed form, with all media included). They have successfully taken a 25 GB file (a single-file embodiment of the Wikipedia without most of the media) and compressed it to a measly 8 GB zim file archive. And through the Kiwix program, you can call up any page of the Wikipedia, with all associated data, faster than you can surfing the net.
Even though Kiwix program is a technology based around the Wikipedia database structure, it proves that you can have excellent compression ratios and random access simultaneously.
我不确定这在您的具体情况下是否实用,但是您不能将每个大文件压缩为较小的文件(例如每个 10 MB)吗? 您最终会得到一堆文件:file0.gz、file1.gz、file2.gz 等。根据原始大文件中给定的偏移量,您可以在名为
"file" + (offset / 10485760)+“.gz”
。 未压缩存档内的偏移量将为offset % 10485760
。I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named
"file" + (offset / 10485760) + ".gz"
. The offset within the uncompressed archive would beoffset % 10485760
.我是一个用于压缩特定类型生物数据的开源工具的作者。 该工具称为“淀粉”,按染色体分割数据,并使用这些分割作为索引,以便快速访问较大档案中的压缩数据单元。
对每染色体数据进行转换以消除基因组坐标中的冗余,并使用
bzip2
或gzip
算法压缩转换后的数据。 偏移量、元数据和压缩的基因组数据连接到一个文件中。源代码可从我们的 GitHub 网站获取。 我们已经在 Linux 和 Mac OS X 下编译了它。
对于您的情况,您可以将标头中的偏移量(10 MB 或其他)存储为自定义存档格式。 您可以解析标头,检索偏移量,并通过
current_offset_sum
+header_size
增量地fseek
遍历文件。I am the author of an open-source tool for compressing a particular type of biological data. This tool, called
starch
, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either
bzip2
orgzip
algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.
For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally
fseek
through the file bycurrent_offset_sum
+header_size
.