当前位置：文江博客话题详情

IEEE-754 数据的压缩算法

发布于 2024-08-20 22:52:16 字数 579 浏览 6 评论 0原文

有人推荐一种可以很好地处理双精度浮点值的良好压缩算法吗？我们发现，浮点值的二进制表示会导致常见压缩程序（例如 Zip、RAR、7-Zip 等）的压缩率非常差。

我们需要压缩的数据是按单调递增顺序排序的 8 字节值的一维数组。这些值代表开尔文温度，跨度通常小于 100 度。值的数量范围从几百到最多 64K。

说明

数组中的所有值都是不同的，但由于浮点值的表示方式，在字节级别确实存在重复。
需要无损算法，因为这是科学数据。如果存储效率有显着提高，则转换为具有足够精度（~5 位小数）的定点表示形式可能是可以接受的。

更新

发现一篇关于此主题的有趣文章。不确定该方法是否适用于我的要求。

https://userweb.cs.txstate.edu/~burtscher/papers/ dcc06.pdf

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

左秋 2024-08-27 22:52:16

首先要考虑的是：在将数据转换为双精度之前尝试压缩数据。回复 David Thornley 的评论，除非您的 IR 成像 ADC 有 24 个有效位，否则 32 位浮点数应该具有足够的精度；只有您要求精确保留后续处理产生的噪声才是问题。如果做不到这一点，可以想象，通过确定它生成的值表并存储该表的索引，对您的处理进行逆向工程可能是可行的。

第二：如果你的压缩算法知道你的数据是8字节块，那么它会更有效；这是因为它不会将最高有效字节与最低有效字节放在一起。作为一种粗略的预处理方法，您可以尝试在每个 double 前面加上一个独特的字节（也许是 ASCII 逗号？），然后再通过基于字节的压缩器（如 gzip）进行管道传输；即使中间流大了 12%，这也应该会产生更好的总压缩率。不太粗糙但更费力的是编写适合此任务的自己的压缩 - 也许使用 8 级树来表示双精度中每个字节的预期值。

第三：由于图像数据高度冗余，某种形式的增量编码或其他与图像相关的压缩应该节省一些空间。然而，如果您需要无损压缩，它不会给您带来很大的好处，因为图像噪声本质上是不可压缩的。此外，它不会帮助您处理双精度值中较低有效位中的伪随机哈希，如上所述。

回复收藏 0 原文

蓬勃野心 2024-08-27 22:52:16

您列出的所有编码器都是面向字节的，并且被双精度数的一些属性所困扰。其一是 12 位指数/符号在布局中与字节边界不能很好地配合，其二是输入的噪音。第一部分很容易通过多种方式处理，第二部分将限制您对其进行的任何无损压缩的有效性。我认为即使是最好的结果也不会令人惊奇，我不知道你的数据，但我怀疑你只能指望 25% 的节省，或多或少。

从我的头脑中，也许没用，因为你已经想到了这个列表上的所有内容...

将流视为 64 位整数并对相邻值进行增量编码。如果您运行的值具有相同的指数，它将有效地将其清零，并可能将一些高尾数位清零。会有溢出，但数据仍然只需要 64 位，并且操作可以逆转。
在此阶段，您可以选择尝试一些粗略的整数预测，并保存差异。
如果您之前遵循了建议，那么您将拥有几乎一半的值以 000 开头...并且几乎一半的值以 FFF...要消除这种情况，请将值向左旋转 (ROL) 1 位并对其进行异或如果当前 LSB 为 1，则与所有 F 进行异或。如果 LSB 为 0，则反向是与 F 进行 XOR，然后是 ROR。

第二个想法是，简单地将预测与真实值进行异或可能比差异更好，因为这样您就不必执行步骤 3。

您可以尝试对字节重新排序，将具有相同意义的字节分组在一起。例如，首先是所有最高有效字节，等等。至少，您应该先得到类似于大量零的结果，最多只有少量噪音。
在运行零时运行通用压缩器甚至第一个 RLE，然后运行熵编码器，例如 huffman 或更好的 7zip/LZMA 范围编码器。

你的数据有一个好处，那就是单调。您的数据有一个坏处：它的集合太小了。您想节省多少，仅仅是千字节吗？做什么的？如果相邻值之间经常存在指数差异，压缩效果将会受到很大影响。

如果您正在处理大量这些数据集，您应该考虑利用它们的相似性将它们更好地压缩在一起 - 也许在某个阶段将它们交错。如果您可以忍受一些损失，那么将一些最不重要的字节清零可能是一个好主意 - 也许在源数据和预测上都如此，这样您就不会在那里重新引入噪音。

All the encoders you list are byte oriented, and are thrown off by a few properties of doubles. For one there is the layout where the 12-bit exponent/sign doesn't really play well with byte boundaries, for other there's the noisiness of your input. The first part is easy to deal with in multitude of ways, the second will limit the effectiveness of any lossless compression that you throw at it. I think that even the best result will be less than amazing, i don't know your data but i'd suspect you can count on mere 25% save, more or less.

From the top of my head, and perhaps useless because you have thought of everything on this list...

Treat the stream as 64-bit integers and delta-encode adjacent values. If you have runs of values with the same exponent, it will effectively zero it out, as well as possibly some high mantissa bits. There will be overflows, but the data still needs only 64 bits and the operation can be reveresed.
At this stage you can optionally try some crude integer prediction, and save differences.
If you have followed the suggestion before, you will have almost half values starting with 000... and almost half with FFF... To eliminate that, rotate the value to the left (ROL) by 1 bit and XOR it with all Fs if the current LSB is 1. Reverse is XOR with Fs if LSB is 0 then ROR.

On the second thought simply XORing predictions to true values can be better than difference, because you don't have to do step 3 then.

You can try reordering bytes to group bytes with same significance together. Like, first all most significant bytes, and so on. At the very least you should get something like a massive run of zeroes with at most few bits of noise first.
Run through a generic compressor or even first RLE on the run of zeroes, then an entropy encoder, like huffman, or better, range encoder from 7zip/LZMA.

There is one good thing about your data, it is monotonous. There is a bad thing about your data: it's simply too small a set. How much do you want to save, mere kilobyes? what for? The compression effectiveness will suffer a lot if there is often exponent difference between adjacent values.

If you are processing large number of those data sets, you should consider using their similarity to compress them together better - perhaps interleave them at some stage. If you can live with some loss, zeroing out some least significant bytes might be a good idea - perhaps both on source data and on prediction so that you don't reintroduce noise there.

回复收藏 0 原文

冰魂雪魄 2024-08-27 22:52:16

如果您想要高压缩率的档案存储，请参阅 Burtscher & 的“双精度浮点数据的高吞吐量压缩”。 Patanaworabhan 或 Lindstrom & 的“浮点数据的快速高效压缩”伊森伯格可能对你有帮助。

如果您希望以较低的压缩率为代价实现更快的动态访问，那么一维提升小波可能是合适的。您可以通过指定要保留的位数将数据量化为更小的整数。然后使用具有预测模型的增量编码，然后使用 Haar 或更昂贵的小波变换进行变换，并对大于指定值的系数进行算术编码。

希望它有帮助

您可以在这里获取 Lindstrom 的 ZFP 算法： https://github.com/LLNL/zfp

回复收藏 0 原文

国产ˉ祖宗 2024-08-27 22:52:16

压缩算法依赖于重复和规律性，而浮点数在这方面表现不佳。

第一个问题是您是否可以使用单精度浮点值，这将立即为您提供 50% 的压缩。很少有温度计能精确到七位数字，而且指数所代表的温度远低于我告诉你的实际温度。

如果不是，您可以过滤温度，将其四舍五入到相当于 N 位数字（更可能是 N/.301 位）吗？这可能会引入足够有用的规律。

如果您确实必须为每个温度读数存储 64 位信息，并且所有位都很重要并且无法从其他位中预测，那么您实际上无法压缩它。

回复收藏 0 原文

夜唯美灬不弃 2024-08-27 22:52:16

最近，我对无损压缩进行了一项研究，尝试将浮点值分解为组成部分——符号位、指数和尾数（进一步细分为 3 部分），然后使用常规方法分别压缩每个部分，包括差分预测器和 Deflate 或 Huffman 编码。尺寸的减小幅度不大，仅约 50%，但至少是无损的。它确实表明，即使在浮点数据中，也有可以找到冗余的地方。

我在浮点数据无损压缩上发布了一篇文章
如果您仍然对该问题感兴趣，那么与本文相关的 Java 源代码可能会很有用。

回复收藏 0 原文

油焖大侠 2024-08-27 22:52:16

您可以考虑使用熵编码器（Huffman、Shannon-Fano、算术编码）重新编码数据。但是，只有当您有多次重复的数据点并且您知道哪些符号将以何种概率出现时，这才会提供良好的结果。

回复收藏 0 原文

遗忘曾经 2024-08-27 22:52:16

你能在相邻值之间做增量吗？
测量之间的值变化幅度是否有限制？将这种变化限制为某个最大速率值（以引入一些平滑为代价？）是否可以接受？

显然，热传感器值的实际精度存在限制，您是否需要存储 64 位精度，或者您是否更好存储整数个 0.01-开尔文单位？

如果您可以忍受更多的错误，并且增长相对平稳，那么您最好只将函数拟合到数据并仅存储函数的几个项。

编辑：
看一下典型的数据集并查看相邻值之间的差异范围。然后看看你需要什么精度来表示这个。

例如。
如果读数之间的最大差异为 1 度，则可以将 1/256 的变化存储在一个字节中。如果您需要存储更大的范围或更精确，请使用短除以某个因子。
所以下一个读数将是 = last_reading + (float)increment/256.0

回复收藏 0 原文

清君侧 2024-08-27 22:52:16

如果解压缩和压缩速度很重要，我建议不要使用基于预测的编码（例如 FPC、DFCM [在其他答案中提到]）。

近年来，出现了一系列针对浮点数据定制的新无损编码，这些编码在本线程中没有提及，并且已经被证明比 DFCM 和 FPC 好得多。

基于 XOR 的方法：通过与序列中的先前值（无预测变量）进行按位 XOR 进行压缩。目前此类中最好的两个是：

Chimp128：https://github.com/ panagiotisl/chimp
Elf：https://github.com/panagiotisl/chimp com/Spatio-Temporal-Lab/elf（通过权衡压缩速度获得更多压缩比）。

Gorilla (https://github.com/ghilesmeddour/gorilla -time-series-compression）是最著名的 XORing 方法之一，但它已被证明不如 Chimp128。

ALP：https://github.com/cwida/ALP。通过无损地将双精度转换为整数表示形式进行压缩（从视觉上看，您可以将其视为移动逗号，直到不再有小数），然后使用参考框架对这些整数进行编码。这只适用于一定的精度。如果浮点数是高精度的（例如ML参数），则方案会更改为使用字典编码来压缩前位。 ALP 的解码速度非常快（甚至比读取未压缩数据还要快），同时实现了比所有其他算法更高的压缩比。

ALP 可用于压缩 DuckDB 中的 double 和 float 列以进行快速测试（并且是默认编码）。

Zstd 在浮点数据上工作得很好，但压缩和解压缩（相对）慢。此外，解压缩必须在相对较大的值块中完成。

免责声明：我是 ALP 的创建者之一。我们做了一项研究，比较了许多现代浮点编码在各种数据集上的压缩率和编码/解码速度。结果可以在这里查看：https://dl.acm.org/doi/pdf/ 10.1145/3626717。第 16 页：表 4 压缩比比较和第 2 页：图 1 和图 2第 17 页：表 5 用于[解]压缩速度比较。

If decompression and compression speed are important, I would recommend against using prediction-based encodings (e.g. FPC, DFCM [mentioned in other answers]).

In recent years, a flurry of new lossless encodings tailored for floating-point data have appeared which have not been mentioned in this thread and have already been demonstrated to be miles better than DFCM and FPC.

XOR-based approaches: Compress by doing a bitwise XOR with a previous value in the sequence (no predictors). The two best of this kind are currently:

Chimp128: https://github.com/panagiotisl/chimp
Elf: https://github.com/Spatio-Temporal-Lab/elf (more compression ratios by trading-off compression speed).

Gorilla (https://github.com/ghilesmeddour/gorilla-time-series-compression) is the most famous one of the XORing approaches, however it has already been demonstrated to be inferior than Chimp128.

ALP: https://github.com/cwida/ALP. Compress by losslessly transforming doubles into an integer representation (visually you could see it as moving the comma until there are no more decimals) and then encodes these integers using Frame-of-reference. This only works up to a certain precision. If floats are of high-precision (e.g. ML parameters), scheme changes to compress front-bits with dictionary encoding. ALP is insanely fast to decode (even faster than reading uncompressed data) while achieving higher compression ratios than all the other algorithms.

ALP is available (and is the default encoding) to compress double and float columns in DuckDB for quick testing.

Zstd works pretty well on floating point data however compression and decompression are (relatively) slow. Also decompression must be done in relatively big blocks of values.

Disclaimer: I am one of the creators of ALP. We did a study comparing a lot of modern floating-point encodings in terms of compression ratios and encoding/decoding speed on a variety of datasets. Results can be seen here: https://dl.acm.org/doi/pdf/10.1145/3626717. Page 16:Table 4 for compression ratios comparison and Page 2:Figure 1 & Page 17:Table 5 for [de]compression speed comparison.

回复收藏 0 原文

~没有更多了~