允许在文件中随机读/写的最佳压缩算法是什么?

发布于 2024-07-07 17:02:50 字数 1560 浏览 13 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我只土不豪 2024-07-14 17:02:51

令我惊讶的是,有如此多的回复暗示这种事情是不可能的。

这些人难道没听说过“压缩文件系统”吗?
自 1993 年 Microsoft 因压缩文件系统技术被 Stac Electronics 起诉以来,哪些技术就已经存在了?

我听说 LZSLZJB 是人们实现压缩文件系统的流行算法,压缩文件系统必然需要随机访问读取和随机访问写入。

也许最简单和最好的做法是为该文件打开文件系统压缩,并让操作系统处理细节。
但如果您坚持手动处理它,也许您可​​以通过阅读 NTFS 透明文件压缩

另请查看:
"StackOverflow:对随机访问提供良好支持的压缩格式档案?”

I am stunned at the number of responses that imply that such a thing is impossible.

Have these people never heard of "compressed file systems",
which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?

I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.

Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details.
But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

Also check out:
"StackOverflow: Compression formats with good support for random access within archives?"

基于字典的压缩方案,每个字典条目的代码都以相同的大小进行编码,将导致能够以代码大小的任意倍数开始读取,并且如果代码不使用其上下文,则写入和更新很容易/邻居。

如果编码包含区分代码开始或结束的方法,那么您不需要代码具有相同的长度,并且可以从文件中间的任何位置开始读取。 如果您从流中的未知位置读取,则此技术更有用。

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.

If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

烛影斜 2024-07-14 17:02:51

我认为斯蒂芬·丹尼可能在这里有所发现。 想象一下:

  • 类似 zip 的序列压缩来编码
  • 字典映射代码 -> 序列
  • 文件就像一个文件系统
    • 每次写入都会生成一个新的“文件”(字节序列,根据字典压缩)
    • “文件系统”跟踪哪个“文件”属于哪些字节(开始、结束)
    • 每个“文件”都根据字典进行压缩
    • 按文件读取工作,根据“文件系统”解压缩和检索字节
    • 写入使“文件”无效,附加新的“文件”来替换无效的文件
  • 该系统将需要的无效文件:
    • 文件系统的碎片整理机制
    • 不时压缩字典(删除未使用的代码)
  • 正确完成,可以在无人查看时(空闲时间)或通过创建新文件并最终“切换”来完成内务管理,

一个积极的效果是字典将应用于整个文件。 如果您可以节省 CPU 周期,您可以定期检查与“文件”边界重叠的序列,然后将它们重新分组。

这个想法是为了真正的随机读取。 如果您只打算读取固定大小的记录,那么这个想法的某些部分可能会变得更容易。

I think Stephen Denne might be onto something here. Imagine:

  • zip-like compression of sequences to codes
  • a dictionary mapping code -> sequence
  • file will be like a filesystem
    • each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
    • "filesystem" keeps track of which "file" belongs to which bytes (start, end)
    • each "file" is compressed according to dictionary
    • reads work filewise, uncompressing and retrieving bytes according to "filesystem"
    • writes make "files" invalid, new "files" are appended to replace the invalidated ones
  • this system will need:
    • defragmentation mechanism of filesystem
    • compacting dictionary from time to time (removing unused codes)
  • done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually

One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.

This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

韬韬不绝 2024-07-14 17:02:51

我不知道有什么压缩算法允许随机读取,更不用说随机写入了。 如果您需要这种能力,最好的选择是将文件分成块而不是整个压缩。

例如
我们首先看看只读情况。 假设您将文件分成 8K 块。 您压缩每个块并按顺序存储每个压缩块。 您需要记录每个压缩块的存储位置及其大小。 然后,假设您需要从偏移量 O 开始读取 N 个字节。您需要找出它位于哪个块中 (O / 8K),解压缩该块并获取这些字节。 您需要的数据可能跨越多个块,因此您必须处理这种情况。

当您希望能够写入压缩文件时,事情会变得复杂。 您必须处理越来越大和越来越小的压缩块。 您可能需要为每个块添加一些额外的填充,以防它扩展(未压缩时它的大小仍然相同,但不同的数据将压缩为不同的大小)。 如果压缩数据太大而无法放回给定的原始空间,您甚至可能需要移动块。

这基本上就是压缩文件系统的工作原理。 您可能最好为文件打开文件系统压缩并正常读取/写入它们。

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.

e.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.

Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

同展鸳鸯锦 2024-07-14 17:02:51

压缩就是消除数据中的冗余。 不幸的是,冗余不太可能以单调均匀的方式分布在整个文件中,这大约是您可以期望压缩和细粒度随机访问的唯一场景。

但是,您可以通过维护在压缩期间构建的外部列表来接近随机访问,该列表显示未压缩数据流中选定的点与其在压缩数据流中的位置之间的对应关系。 显然,您必须选择一种方法,其中源流与其压缩版本之间的转换方案不会随流中的位置而变化(即没有 LZ77 或 LZ78;相反,您可能想要使用 Huffman 或 byte-对编码。)显然,这会产生大量开销,并且您必须决定如何在“书签点”所需的存储空间和解压缩从 a 开始的流所需的处理器时间之间进行权衡。书签点以获取您在该读取中实际查找的数据。

至于随机访问写入......这几乎是不可能的。 如前所述,压缩是指消除数据中的冗余。 如果您尝试用不具有相同冗余的数据来替换可能并且已经被压缩的数据,因为它是冗余的,那么它根本不适合。

但是,根据您要执行的随机访问写入量,您可以通过维护表示压缩后写入文件的所有数据的稀疏矩阵来模拟它。 在所有读取中,您将检查矩阵以查看是否正在读取压缩后写入的区域。 如果没有,那么您将转到压缩文件中获取数据。

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.

However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.

As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.

However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文