用于随机访问压缩数据文件的简单 API

发布于 2024-09-07 19:57:42 字数 287 浏览 9 评论 0原文

请推荐适合以下任务的技术。

我有一个相当大的(500MB)数据块,它基本上是一个数字矩阵。数据熵很低(应该可以很好地压缩)并且存储成本很高。

我正在寻找的是用一个好的压缩算法(比如 GZip)来压缩它,并带有标记,可以实现非常偶然的随机访问。随机访问如“从原始(未压缩)流中的位置 [64 位地址] 读取字节”。这与 ZLIB 等经典压缩器库略有不同,ZLIB 可以让您连续解压缩流。我想要的是随机访问的延迟,比如说,每个字节读取的解压缩工作高达 1MB。

当然,我希望使用现有的库而不是重新发明 NIH 轮子。

Please recommend a technology suitable for the following task.

I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.

What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.

Of course, I hope to use existing library rather than reinvent the NIH wheel.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

衣神在巴黎 2024-09-14 19:57:44

我认为压缩算法通常以块为单位工作,因此您也许可以根据块大小提出一些建议。

Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.

梦纸 2024-09-14 19:57:44

我建议使用 Boost Iostreams 库 。 Boost.Iostreams 可用于创建流来访问 TCP 连接或作为加密和数据压缩的框架。该库包含用于访问内存映射文件、使用操作系统文件描述符进行文件访问、代码转换、使用正则表达式进行文本过滤、行结束转换以及 zlib、gzip 和 bzip2 格式压缩和解压缩的组件。

Boost 库已被 C++ 标准委员会接受为 TR2 的一部分,因此它最终将内置到大多数编译器中(在 std::tr2::sys 下)。它也是跨平台兼容的。

Boost 版本

Boost 入门指南 注意:只有 boost::iostreams 的某些部分是仅标头的链接时不需要单独编译的库二进制文件或特殊处理的库。

I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.

The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.

Boost Releases

Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.

慕巷 2024-09-14 19:57:44
  1. 对大文件进行排序,首先
  2. 将其分成所需大小 (1MB) 的块,并在名称中包含一些序列(File_01、File_02、..、File_NN),
  3. 从每个块中获取第一个 ID 加上文件名,然后将这两个数据放入另一个文件中,
  4. 压缩 。
  5. 您将能够使用您希望的方法搜索 ID 的文件,可以是二进制搜索并根据需要打开每个文件

如果您需要深度索引,您可以使用 BTree 算法,其中“页面”是文件。
网络上存在几种实现此方法的方法,因为代码并不复杂。

  1. Sort the big file first
  2. divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
  3. take first ID from each chunk plus the filename and put both data into another file
  4. compress the chunks
  5. you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.

If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files.
on the web exists several implementation of this because are little tricky the code.

浪漫人生路 2024-09-14 19:57:44

您可以使用 bzip2 并基于 James Taylor 的 seek-bzip2 轻松创建自己的 API

You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2

静若繁花 2024-09-14 19:57:43

如果您使用 Java,我刚刚为此发布了一个库:http://code.google.com /p/jzran

If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.

滥情空心 2024-09-14 19:57:43

字节对编码允许随机访问数据。

您不会获得那么好的压缩效果,但是您会为了单个树而牺牲自适应(可变)哈希树,因此您可以访问它。

但是,您仍然需要某种索引才能找到特定的“字节”。由于您可以接受 1 MB 的延迟,因此您将为每 1 MB 创建一个索引。希望您能找到一种方法,使索引足够小,以便仍能从压缩中受益。

这种方法的好处之一是随机访问编辑。您可以更新、删除和插入相对较小的数据块。

如果很少访问,您可以使用 gzip 压缩索引并在需要时对其进行解码。

Byte Pair Encoding allows random access to data.

You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.

However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.

One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.

If it's accessed rarely, you could compress the index with gzip and decode it when needed.

我做我的改变 2024-09-14 19:57:43

如果您想最大程度地减少所涉及的工作,我只需将数据分成 1 MB(或任何大小)的块,然后将这些块放入 PKZIP 存档中。然后,您需要一点前端代码来获取文件偏移量,并除以 1M 以获得要解压缩的正确文件(显然,使用余数来获得该文件中的正确偏移量)。

编辑:是的,有现有的代码可以处理这个问题。 Info-zip 解压缩的最新版本(当前为 6.0)包括 api.c。除其他外,其中包括 UzpUnzipToMemory ——您向其传递 ZIP 文件的名称以及该存档中要检索的文件之一的名称。然后您将获得一个保存该文件内容的缓冲区。要进行更新,您需要 zip3.0 中的 api.c,使用 ZpInitZpArchive(尽管这些并不那么简单)用作解压面)。

或者,您可以在后台运行 zip/unzip 的副本来完成工作。这不是那么简洁,但无疑实现起来更简单(并且如果您选择的话,还允许您轻松切换格式)。

If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).

Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).

Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

手心的海 2024-09-14 19:57:43

看看我的项目 - csio。我认为这正是您所寻找的:包含类似 stdio 的接口和多线程压缩器。

它是用 C 语言编写的库,提供 CFILE 结构和函数 cfopencfseekcftello 等。您可以将其与常规(未压缩)文件以及借助 dzip 实用程序压缩的文件一起使用。该实用程序包含在项目中并用 C++ 编写。它生成有效的 gzip 存档,可以由标准实用程序以及 CSIO 处理。 dzip 可以在多个线程中进行压缩(请参阅 -j 选项),因此它可以非常快速地压缩非常大的文件。

典型用法:

dzip -j4 myfile

...

CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);

它是 MIT 许可的,因此您可以不受限制地在您的项目中使用它。有关更多信息,请访问 github 上的项目页面:https://github.com/hoxnox/csio

Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.

It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.

Tipical usage:

dzip -j4 myfile

...

CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);

It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文