在压缩排序的固定宽度文件中搜索

发布于 2024-08-29 17:10:25 字数 160 浏览 6 评论 0原文

假设我有一个常规的固定宽度文件,该文件在其中一个字段上排序。鉴于我知道记录的长度,我可以使用 lseek 实现二分搜索来查找字段与给定值匹配的记录,而无需读取整个文件。

现在的困难是文件被压缩了。是否可以在不完全膨胀文件的情况下执行此操作?如果没有使用 gzip。有没有支持这种行为的压缩?

Assume I have a regular fixed width file that is sorted on one of the fields. Given that I know the length of the records, I can use lseek to implement a binary search to find records with fields that match a given value without having to read the entire file.

Now the difficulty is that the file is gzipped. Is it possible to do this without completely inflating the file? If not with gzip. is there any compression that supports this kind of behavior?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

缺⑴份安定 2024-09-05 17:10:25

bzip2 文件格式由多个独立压缩的块组成。
如果您愿意在 bzip2 文件旁边维护一个索引,您可以
知道该去哪里。

注意:这是问题的重复:

这些答案相同问题,但也将 BGZF 标识为与 gzip 兼容的输出格式,并插入同步点以重置压缩状态。

The bzip2 file format consists of multiple independently-compressed blocks.
If you're willing to maintain an index alongside of your bzip2 file, you could
know where to lseek to.

Note: This is a duplicate of the questions:

These answer the same question, but also identity BGZF as a gzip-compatible output format with sync points inserted to reset the compression state.

兰花执着 2024-09-05 17:10:25

对于使用 zip 及其衍生工具压缩的文件来说,这是完全不可能的。它们基于滚动字典窗口,通常对输出代码的最高有效位进行某种基于缓冲区的压缩。最重要的是,如果没有上下文,zip 文件中的特定字节序列就没有意义。

如果您希望能够从压缩文件中随机读取特定记录,则必须独立压缩每个记录,然后在文件中建立索引。根据您的数据,这可能会使压缩步骤毫无价值。

This is totally impossible with a file compressed with zip and derivatives. Those are based on a rolling dictionary window, typically with some sort of buffer-based compression of the most significant bits of the output codes on top of that. Bottom line is that a particular sequence of bytes in a zip file is meaningless without context.

If you want to be able to randomly read a particular record out of a compressed file, you have to compress each record independently and then have an index into the file. Depending on your data, this would probably render the compression step worthless.

余罪 2024-09-05 17:10:25

我知道的几乎所有压缩算法都在块模式下工作,这意味着随机搜索是不可能的。即使不使用初始字典的 LZMA 也需要顺序解压。

流压缩通常意味着自适应有损压缩,带有一些重置状态(或实际上切割成块)的密钥。细节更复杂。

现在这里有一些解决此问题的想法:

  • 创建索引:就像打开 ZIP 时,您可以看到其中的所有文件
  • 压缩文件切成块,然后在每个块中使用二分搜索(实际上与第一个块类似)
  • 在内存中解压缩,但实际上会丢弃任何数据,直到找到要查找的数据的开头。

最后一种方法适用于小型压缩文件,块方法适用于较大的压缩文件。您可以混合两者。

PS:输入中固定为,并不意味着压缩文件将被固定。所以这是一个非常无用的信息。

Pretty much all compression algo I know work in block mode, meaning a random seek isn't possible. Even LZMA which doesn't use an initial dictionary requires sequential decompression.

Stream compression means usually an adaptive lossy compression with some key that reset state (or actually cut into blocks). Details are more complex.

Now here are a couple of ideas to solve this:

  • Create an index: Like when you open a ZIP you can see all files in it
  • Cut your compressed file into blocks and then use a binary search within each block (similar actually to the first one)
  • Decompress in memory but actually discard any data until you found the beginning of the data you're looking for.

The last way is good for small compressed files, and the block method is good for larger compressed files. You can mix the two.

PS: Fixed with in the input, doesn't mean the compressed file will be fixed with. So it's a pretty useless info.

与酒说心事 2024-09-05 17:10:25

根据 Wernight 所说的,您可以在 gzip 之前将文件分割成许多固定大小的子文件。您的二分搜索可以从搜索包含该范围的子文件开始,然后只需要解压缩小子文件而不是整个文件。您可以通过在包含每个子文件的第一行的存档中创建上层文件来进行优化。

Building on what Wernight said, you can split your file into many fixed size subfiles before you gzip it. Your binary search can start by searching for the subfile that contains the range, then it will only need to decompress the small subfile rather than the whole thing. You can optimize by creating an upper-level file in the archive that contains the first row of each subfile.

留蓝 2024-09-05 17:10:25

继续Liudvikas Bukys 所说的:如果您的压缩块具有唯一的标头,则不需要索引。这与某些压缩视频格式的搜索方式类似。您寻找一个点并寻找下一个标题。但这确实需要强大的验证(使用校验和),因为错误识别是可能的。

Continuing on what Liudvikas Bukys says: If your compressed blocks have an unique header, you don't need the index. That's similar to how seeking in some compressed video formats is done. You seek to a point and look for the next header. This does need robust validation (using a checksum) though, as mis-identification is possible.

轻拂→两袖风尘 2024-09-05 17:10:25

你想要的是可搜索的压缩; dict 服务器具有与 gzip 格式兼容的 dictzip,因为它将其可查找存储在标头的 gzip 扩展中,而 sleuth 套件具有 sgzip,但它不是,因为它在每个块的开头存储块长度

what you want is seekable compression; the dict server has dictzip which is format compatible with gzip as it stores it's seektable in a gzip extension in the header and sleuth kit has sgzip which isn't as it stores block lengths at the beginning of each of the blocks

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文