在 7z 单文件存档中随机查找

发布于 2024-12-11 19:53:52 字数 601 浏览 0 评论 0原文

是否可以对由 7zip 压缩的非常大的文件进行随机访问(大量搜索)?

原始文件非常大(999gb xml),我无法以解压格式存储它(我没有那么多可用空间)。因此,如果 7z 格式允许访问中间块而无需解压缩所选块之前的所有块,我可以构建块开头和相应原始文件偏移量的索引。

我的 7z 存档的标头是

37 7A BC AF 27 1C 00 02 28 99 F1 9D 4A 46 D7 EA  // 7z archive version 2;crc; n.hfr offset
00 00 00 00 44 00 00 00 00 00 00 00 F4 56 CF 92  // n.hdr offset; n.hdr size=44. crc
00 1E 1B 48 A6 5B 0A 5A 5D DF 57 D8 58 1E E1 5F
71 BB C0 2D BD BF 5A 7C A2 B1 C7 AA B8 D0 F5 26
FD 09 33 6C 05 1E DF 71 C6 C5 BD C0 04 3A B6 29

UPDATE: 7z archiver 说该文件有一个数据块,使用 LZMA 算法压缩。测试解压速度为600MB/s(解压数据),仅使用一个CPU核心。

Is it possible to do random access (a lot of seeks) to very huge file, compressed by 7zip?

The original file is very huge (999gb xml) and I can't store it in unpacked format (i have no so much free space). So, if 7z format allows accessing to middle block without uncompressing all blocks before selected one, I can built an index of block beginning and corresponding original file offsets.

Header of my 7z archive is

37 7A BC AF 27 1C 00 02 28 99 F1 9D 4A 46 D7 EA  // 7z archive version 2;crc; n.hfr offset
00 00 00 00 44 00 00 00 00 00 00 00 F4 56 CF 92  // n.hdr offset; n.hdr size=44. crc
00 1E 1B 48 A6 5B 0A 5A 5D DF 57 D8 58 1E E1 5F
71 BB C0 2D BD BF 5A 7C A2 B1 C7 AA B8 D0 F5 26
FD 09 33 6C 05 1E DF 71 C6 C5 BD C0 04 3A B6 29

UPDATE: 7z archiver says that this file has a single block of data, compressed with LZMA algorithm. Decompression speed on testing is 600 MB/s (of unpacked data), only one CPU core is used.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

吐个泡泡 2024-12-18 19:53:52

这在技术上是可行的,但如果您的问题是“当前可用的二进制 7zip 命令行工具是否允许这样做”,那么不幸的是答案是否定的。
它允许的最好是将每个文件独立压缩到存档中,从而允许直接检索文件。
但由于您要压缩的是单个(巨大)文件,因此此技巧将不起作用。

恐怕唯一的方法是将文件分成小块,并将它们提供给 LZMA 编码器(包含在 LZMA SDK 中)。不幸的是,这需要一些编程技能。

注意:可以在这里找到技术上较差但微不足道的压缩算法。
主程序正是您所寻找的:将源文件切成小块,并将它们一一送入压缩器(在本例中为 LZ4)。然后解码器执行相反的操作。它可以轻松跳过所有压缩块并直接转到您想要检索的块。
http://code.google.com/p/lz4/source /browse/trunk/lz4demo.c

It's technically possible, but if your question is "does the currently available binary 7zip command line tool allows that', the answer is unfortunately no.
The best it allows is to compress independantly each file into the archive, allowing the files to be retrieved directly.
But since what you want to compress is a single (huge) file, this trick will not work.

I'm afraid the only way is to chunk your file into small blocks, and to feed them to an LZMA encoder (included in LZMA SDK). Unfortunately that requires some programming skills.

Note : a technically inferior but trivial compression algorithm can be found here.
The main program does just what you are looking for : cut the source file into small blocks, and feed them one by one to a compressor (in this case, LZ4). The decoder then does the reverse operation. It can easily skip all the compressed blocks and go straight to the one you want to retrieve.
http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c

雪若未夕 2024-12-18 19:53:52

怎么样:

概念:因为您基本上只读取一个文件,所以按块索引 .7z。

逐块读取压缩文件,为每个块指定一个编号,并可能在大文件中指定一个偏移量。扫描数据流中的目标项目锚点(例如维基百科文章标题)。对于每个锚记录,保存项目开始的块号(可能在之前的块中),

将索引写入某种 O(log n) 存储。对于访问,检索块号及其偏移量,提取块并查找项目。成本与提取一个块(或很少)以及该块中的字符串搜索有关。

为此,您必须通读该文件一次,但您可以对其进行流式处理并在处理后丢弃它,这样就不会影响磁盘。

DARN:你基本上在你的问题中假设了这一点......在回答之前阅读问题似乎是有利的......

How about this:

Concept: because you are basically reading only one file, index the .7z by block.

read the compressed file block by block, give each block a number and possibly an offset in the large file. scan for target item anchors in the data stream (eg. wikipedia article titles). For each anchor record save the blocknumber where the item began (that was maybe in the block before)

write the index to some kind of O(log n) store. For an access, retrieve the blocknumber and its offset, extract the block and find the item. the cost is bound to extraction of one block (or very few) and the string search in that block.

for this you have to read through the file once, but you can stream it and discard it after processing, so nothing hits the disk.

DARN: you basically postulated this in you question... it seems advantageous to read the question before answering...

一曲琵琶半遮面シ 2024-12-18 19:53:52

7z archiver 表示该文件具有单个数据块,并使用 LZMA 算法压缩。

7z / xz 命令是什么来查找是否是单个压缩块?当与多个线程一起使用时,7z 会创建多块(多流)存档吗?

原始文件非常大(999gb xml)

好消息:wikipedia 为其转储切换到多流存档(至少对于 enwiki):http://dumps.wikimedia.org/enwiki/

例如,最近的转储,http://dumps.wikimedia.org/enwiki/20140502/ 具有多流 bzip2(具有单独的索引“offset:export_article_id:article_name”),并且 7z 转储存储在许多子中GB 档案,每个档案有约 3k (?) 篇文章:

文章、模板、媒体/文件描述和主要元页面,位于多个 bz2 流中,每个流 100 页

enwiki-20140502-pages-articles-multistream.xml.bz2 10.8 GB
enwiki-20140502-pages-articles-multistream-index.txt.bz2 150.3 MB

具有完整编辑历史记录的所有页面 (.7z)

enwiki-20140502-pages-meta-history1.xml-p000000010p000003263.7z 213.3 MB
enwiki-20140502-pages-meta-history1.xml-p000003264p000005405.7z 194.5 MB
enwiki-20140502-pages-meta-history1.xml-p000005406p000008209.7z 216.1 MB
enwiki-20140502-pages-meta-history1.xml-p000008210p000010000.7z 158.3 MB
enwiki-20140502-pages-meta-history2.xml-p000010001p000012717.7z 211.7 MB
 ……
enwiki-20140502-pages-meta-history27.xml-p041211418p042648840.7z 808.6 MB

我认为,我们可以使用 bzip2 索引来估计文章 ID,即使是 7z 转储,然后我们只需要具有正确范围的 7z 存档 (..p first_id p last_id .7z)。 stub-meta-history.xml 也可能有帮助。

转储常见问题解答:
http://meta.wikimedia.org/wiki/Data_dumps/FAQ

7z archiver says that this file has a single block of data, compressed with LZMA algorithm.

What was the 7z / xz command to find is it single compressed block or not? Will 7z create multiblock (multistream) archive when used with several threads?

The original file is very huge (999gb xml)

The good news: wikipedia switched to multistream archives for its dumps (at least for enwiki): http://dumps.wikimedia.org/enwiki/

For example, most recent dump, http://dumps.wikimedia.org/enwiki/20140502/ has multistream bzip2 (with separate index "offset:export_article_id:article_name"), and the 7z dump is stored in many sub-GB archives with ~3k (?) articles per archive:

Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream

enwiki-20140502-pages-articles-multistream.xml.bz2 10.8 GB
enwiki-20140502-pages-articles-multistream-index.txt.bz2 150.3 MB

All pages with complete edit history (.7z)

enwiki-20140502-pages-meta-history1.xml-p000000010p000003263.7z 213.3 MB
enwiki-20140502-pages-meta-history1.xml-p000003264p000005405.7z 194.5 MB
enwiki-20140502-pages-meta-history1.xml-p000005406p000008209.7z 216.1 MB
enwiki-20140502-pages-meta-history1.xml-p000008210p000010000.7z 158.3 MB
enwiki-20140502-pages-meta-history2.xml-p000010001p000012717.7z 211.7 MB
 .....
enwiki-20140502-pages-meta-history27.xml-p041211418p042648840.7z 808.6 MB

I think, we can use bzip2 index to estimate article id even for 7z dumps, and then we just need the 7z archive with the right range (..p first_id p last_id .7z). stub-meta-history.xml may help too.

FAQ for dumps:
http://meta.wikimedia.org/wiki/Data_dumps/FAQ

吝吻 2024-12-18 19:53:52

仅使用:

7z e myfile_xml.7z -so | sed [something] 

示例获取第 7 行:

7z e myfile_xml.7z -so | sed -n 7p

Only use:

7z e myfile_xml.7z -so | sed [something] 

Example get line 7:

7z e myfile_xml.7z -so | sed -n 7p

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文