IEEE-754 数据的压缩算法

发布于 2024-08-20 22:52:16 字数 579 浏览 2 评论 0原文

有人推荐一种可以很好地处理双精度浮点值的良好压缩算法吗?我们发现,浮点值的二进制表示会导致常见压缩程序(例如 Zip、RAR、7-Zip 等)的压缩率非常差。

我们需要压缩的数据是按单调递增顺序排序的 8 字节值的一维数组。这些值代表开尔文温度,跨度通常小于 100 度。值的数量范围从几百到最多 64K。

说明

  • 数组中的所有值都是不同的,但由于浮点值的表示方式,在字节级别确实存在重复。

  • 需要无损算法,因为这是科学数据。如果存储效率有显着提高,则转换为具有足够精度(~5 位小数)的定点表示形式可能是可以接受的。

更新

发现一篇关于此主题的有趣文章。不确定该方法是否适用于我的要求。

https://userweb.cs.txstate.edu/~burtscher/papers/ dcc06.pdf

Anyone have a recommendation on a good compression algorithm that works well with double precision floating point values? We have found that the binary representation of floating point values results in very poor compression rates with common compression programs (e.g. Zip, RAR, 7-Zip etc).

The data we need to compress is a one dimensional array of 8-byte values sorted in monotonically increasing order. The values represent temperatures in Kelvin with a span typically under of 100 degrees. The number of values ranges from a few hundred to at most 64K.

Clarifications

  • All values in the array are distinct, though repetition does exist at the byte level due to the way floating point values are represented.

  • A lossless algorithm is desired since this is scientific data. Conversion to a fixed point representation with sufficient precision (~5 decimals) might be acceptable provided there is a significant improvement in storage efficiency.

Update

Found an interesting article on this subject. Not sure how applicable the approach is to my requirements.

https://userweb.cs.txstate.edu/~burtscher/papers/dcc06.pdf

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

左秋 2024-08-27 22:52:16

首先要考虑的是:在将数据转换为双精度之前尝试压缩数据。回复 David Thornley 的评论,除非您的 IR 成像 ADC 有 24 个有效位,否则 32 位浮点数应该具有足够的精度;只有您要求精确保留后续处理产生的噪声才是问题。如果做不到这一点,可以想象,通过确定它生成的值表并存储该表的索引,对您的处理进行逆向工程可能是可行的。

第二:如果你的压缩算法知道你的数据是8字节块,那么它会更有效;这是因为它不会将最高有效字节与最低有效字节放在一起。作为一种粗略的预处理方法,您可以尝试在每个 double 前面加上一个独特的字节(也许是 ASCII 逗号?),然后再通过基于字节的压缩器(如 gzip)进行管道传输;即使中间流大了 12%,这也应该会产生更好的总压缩率。不太粗糙但更费力的是编写适合此任务的自己的压缩 - 也许使用 8 级树来表示双精度中每个字节的预期值。

第三:由于图像数据高度冗余,某种形式的增量编码或其他与图像相关的压缩应该节省一些空间。然而,如果您需要无损压缩,它不会给您带来很大的好处,因为图像噪声本质上是不可压缩的。此外,它不会帮助您处理双精度值中较低有效位中的伪随机哈希,如上所述。

First thing to consider: try compressing the data before you convert it to double precision. Re your comment to David Thornley, unless your IR imaging ADC's have 24 significant bits, 32-bit floats should have more than enough precision; it is only your requirement to exactly preserve the noise generated by subsequent processing that is a problem. Failing that, it might conceivably be practical to reverse-engineer your processing, by determining a table of values it generates, and storing an index to this table instead.

Second: if your compression algorithm knows that your data is in 8-byte chunks, it will be much more effective; this is because it will not throw your most significant bytes in with the least significant bytes. As a crude preprocessing method, you could try prefixing each double with a distinctive byte (ASCII comma, perhaps?) before piping it through a byte-based compressor like gzip; this should result in better total compression even though the intermediate stream is 12% larger. Less crude but more effort would be to write your own compression adapted to this task -- perhaps using an 8-level tree to represent the expected values of each byte in your double.

Third: as image data is highly redundant, some form of delta coding or other image-related compression should save some space. However, it will not gain you a terribly large amount if you demand lossless compression, as the image noise is inherently incompressible. Also, it will not help you deal with the pseudo-random hash in the less-significant bits of your doubles, as explained above.

蓬勃野心 2024-08-27 22:52:16

您列出的所有编码器都是面向字节的,并且被双精度数的一些属性所困扰。其一是 12 位指数/符号在布局中与字节边界不能很好地配合,其二是输入的噪音。第一部分很容易通过多种方式处理,第二部分将限制您对其进行的任何无损压缩的有效性。我认为即使是最好的结果也不会令人惊奇,我不知道你的数据,但我怀疑你只能指望 25% 的节省,或多或少。

从我的头脑中,也许没用,因为你已经想到了这个列表上的所有内容...

  1. 将流视为 64 位整数并对相邻值进行增量编码。如果您运行的值具有相同的指数,它将有效地将其清零,并可能将一些高尾数位清零。会有溢出,但数据仍然只需要 64 位,并且操作可以逆转。

  2. 在此阶段,您可以选择尝试一些粗略的整数预测,并保存差异。

  3. 如果您之前遵循了建议,那么您将拥有几乎一半的值以 000 开头...并且几乎一半的值以 FFF...要消除这种情况,请将值向左旋转 (ROL) 1 位并对其进行异或如果当前 LSB 为 1,则与所有 F 进行异或。如果 LSB 为 0,则反向是与 F 进行 XOR,然后是 ROR。

第二个想法是,简单地将预测与真实值进行异或可能比差异更好,因为这样您就不必执行步骤 3。

  1. 您可以尝试对字节重新排序,将具有相同意义的字节分组在一起。例如,首先是所有最高有效字节,等等。至少,您应该先得到类似于大量零的结果,最多只有少量噪音。

  2. 在运行零时运行通用压缩器甚至第一个 RLE,然后运行熵编码器,例如 huffman 或更好的 7zip/LZMA 范围编码器。

你的数据有一个好处,那就是单调。您的数据有一个坏处:它的集合太小了。您想节省多少,仅仅是千字节吗?做什么的?如果相邻值之间经常存在指数差异,压缩效果将会受到很大影响。

如果您正在处理大量这些数据集,您应该考虑利用它们的相似性将它们更好地压缩在一起 - 也许在某个阶段将它们交错。如果您可以忍受一些损失,那么将一些最不重要的字节清零可能是一个好主意 - 也许在源数据和预测上都如此,这样您就不会在那里重新引入噪音。

All the encoders you list are byte oriented, and are thrown off by a few properties of doubles. For one there is the layout where the 12-bit exponent/sign doesn't really play well with byte boundaries, for other there's the noisiness of your input. The first part is easy to deal with in multitude of ways, the second will limit the effectiveness of any lossless compression that you throw at it. I think that even the best result will be less than amazing, i don't know your data but i'd suspect you can count on mere 25% save, more or less.

From the top of my head, and perhaps useless because you have thought of everything on this list...

  1. Treat the stream as 64-bit integers and delta-encode adjacent values. If you have runs of values with the same exponent, it will effectively zero it out, as well as possibly some high mantissa bits. There will be overflows, but the data still needs only 64 bits and the operation can be reveresed.

  2. At this stage you can optionally try some crude integer prediction, and save differences.

  3. If you have followed the suggestion before, you will have almost half values starting with 000... and almost half with FFF... To eliminate that, rotate the value to the left (ROL) by 1 bit and XOR it with all Fs if the current LSB is 1. Reverse is XOR with Fs if LSB is 0 then ROR.

On the second thought simply XORing predictions to true values can be better than difference, because you don't have to do step 3 then.

  1. You can try reordering bytes to group bytes with same significance together. Like, first all most significant bytes, and so on. At the very least you should get something like a massive run of zeroes with at most few bits of noise first.

  2. Run through a generic compressor or even first RLE on the run of zeroes, then an entropy encoder, like huffman, or better, range encoder from 7zip/LZMA.

There is one good thing about your data, it is monotonous. There is a bad thing about your data: it's simply too small a set. How much do you want to save, mere kilobyes? what for? The compression effectiveness will suffer a lot if there is often exponent difference between adjacent values.

If you are processing large number of those data sets, you should consider using their similarity to compress them together better - perhaps interleave them at some stage. If you can live with some loss, zeroing out some least significant bytes might be a good idea - perhaps both on source data and on prediction so that you don't reintroduce noise there.

冰魂雪魄 2024-08-27 22:52:16

如果您想要高压缩率的档案存储,请参阅 Burtscher & 的“双精度浮点数据的高吞吐量压缩”。 Patanaworabhan 或 Lindstrom & 的“浮点数据的快速高效压缩”伊森伯格可能对你有帮助。

如果您希望以较低的压缩率为代价实现更快的动态访问,那么一维提升小波可能是合适的。您可以通过指定要保留的位数将数据量化为更小的整数。然后使用具有预测模型的增量编码,然后使用 Haar 或更昂贵的小波变换进行变换,并对大于指定值的系数进行算术编码。

希望它有帮助

您可以在这里获取 Lindstrom 的 ZFP 算法: https://github.com/LLNL/zfp

If you want high-compression archival storage, "High Throughput Compression of Double-Precision Floating-Point Data" by Burtscher & Patanaworabhan or "Fast and Efficient Compression of Floating-Point Data" by Lindstrom & Isenberg may be helpful to you.

If you want faster dynamic access at the expense of a lower compression rate then a 1D lifting wavelet may be appropriate. You can quantize the data to smaller integers by specifying the number of digits to keep. Then use delta encoding with a predictive model followed by transformation with Haar or a more expensive wavelet transform and arithmetic encoding of the coefficients greater than a specified value.

hope it helps

You can get Lindstrom's ZFP algorithm here: https://github.com/LLNL/zfp

国产ˉ祖宗 2024-08-27 22:52:16

压缩算法依赖于重复和规律性,而浮点数在这方面表现不佳。

第一个问题是您是否可以使用单精度浮点值,这将立即为您提供 50% 的压缩。很少有温度计能精确到七位数字,而且指数所代表的温度远低于我告诉你的实际温度。

如果不是,您可以过滤温度,将其四舍五入到相当于 N 位数字(更可能是 N/.301 位)吗?这可能会引入足够有用的规律。

如果您确实必须为每个温度读数存储 64 位信息,并且所有位都很重要并且无法从其他位中预测,那么您实际上无法压缩它。

Compression algorithms live on repetitions and regularities, and floating-point numbers don't do well at that.

The first question is whether you can use single-precision floating-point values, which will give you 50% compression immediately. Few thermometers are accurate to seven digits, and the exponent will represent temperatures considerably below anything I've been told you can actually get.

If not, can you filter your temperatures, rounding them off to the equivalent of N digits (more likely N/.301 bits)? That may introduce enough regularities to be useful.

If you really have to store 64 bits of information for each temperature reading, and all the bits are significant and not predictable from the other bits, then you effectively can't compress it.

夜唯美灬不弃 2024-08-27 22:52:16

最近,我对无损压缩进行了一项研究,尝试将浮点值分解为组成部分——符号位、指数和尾数(进一步细分为 3 部分),然后使用常规方法分别压缩每个部分,包括差分预测器和 Deflate 或 Huffman 编码。尺寸的减小幅度不大,仅约 50%,但至少是无损的。它确实表明,即使在浮点数据中,也有可以找到冗余的地方。

我在浮点数据无损压缩上发布了一篇文章
如果您仍然对该问题感兴趣,那么与本文相关的 Java 源代码可能会很有用。

Recently, I did an investigation into lossless compression where I tried breaking the floating point values out into the constituent parts -- sign bits, exponent, and mantissa (further subdivided into 3 parts), then compressed each of these separately using conventional means including a differencing predictor and either Deflate or Huffman coding. The reduction in size was modest, only about 50 percent, but at least it was non-lossy. It does show that even in floating point data, there are places where you can find redundancy.

I've posted a write-up at Lossless Compression for Floating Point Data
There's Java source code associated with the write up that may useful if you're still interested in the problem.

油焖大侠 2024-08-27 22:52:16

您可以考虑使用熵编码器(Huffman、Shannon-Fano、算术编码)重新编码数据。但是,只有当您有多次重复的数据点并且您知道哪些符号将以何种概率出现时,这才会提供良好的结果。

You could think about re-coding your data with an entropy coder (Huffman, Shannon-Fano, Arithmetic Coding). But this will only provide good results if you have many repetitions of the datapoints and if you know which symbols will appear with which probability.

遗忘曾经 2024-08-27 22:52:16

你能在相邻值之间做增量吗?
测量之间的值变化幅度是否有限制?将这种变化限制为某个最大速率值(以引入一些平滑为代价?)是否可以接受?

显然,热传感器值的实际精度存在限制,您是否需要存储 64 位精度,或者您是否更好存储整数个 0.01-开尔文单位?

如果您可以忍受更多的错误,并且增长相对平稳,那么您最好只将函数拟合到数据并仅存储函数的几个项。

编辑:
看一下典型的数据集并查看相邻值之间的差异范围。然后看看你需要什么精度来表示这个。

例如。
如果读数之间的最大差异为 1 度,则可以将 1/256 的变化存储在一个字节中。如果您需要存储更大的范围或更精确,请使用短除以某个因子。
所以下一个读数将是 = last_reading + (float)increment/256.0

Can you do a delta between adjacent values?
Is there a limit to how much a value can change between measurements? Is it acceptable to limit this change to some maximum rate value (at the cost of introducing some smoothing?)

There obviously a limit to the real accuracy of the values from the thermal sensor, do you need to store 64bits of precision or are you better storing an integer number of say 0.01-Kelvin units?

If you can live with some more error and the increase is relatively smooth you might be better just fitting a function to the data and storing only a few terms of the function.

EDIT:
Take a look at a typical data set and look at the range of differences between adjacent values. Then look at what accuracy you need to represent this.

eg.
If you have a max 1deg difference between readings you could store changes of 1/256 of this in a byte. If you need to store a bigger range or more accuracy use a short divided by some factor.
So next reading would be = last_reading + (float)increment/256.0

清君侧 2024-08-27 22:52:16

如果解压缩和压缩速度很重要,我建议不要使用基于预测的编码(例如 FPC、DFCM [在其他答案中提到])。

近年来,出现了一系列针对浮点数据定制的新无损编码,这些编码在本线程中没有提及,并且已经被证明比 DFCM 和 FPC 好得多。

基于 XOR 的方法:通过与序列中的先前值(无预测变量)进行按位 XOR 进行压缩。目前此类中最好的两个是:

Gorilla (https://github.com/ghilesmeddour/gorilla -time-series-compression)是最著名的 XORing 方法之一,但它已被证明不如 Chimp128。

ALPhttps://github.com/cwida/ALP。通过无损地将双精度转换为整数表示形式进行压缩(从视觉上看,您可以将其视为移动逗号,直到不再有小数),然后使用参考框架对这些整数进行编码。这只适用于一定的精度。如果浮点数是高精度的(例如ML参数),则方案会更改为使用字典编码来压缩前位。 ALP 的解码速度非常快(甚至比读取未压缩数据还要快),同时实现了比所有其他算法更高的压缩比。

ALP 可用于压缩 DuckDB 中的 double 和 float 列以进行快速测试(并且是默认编码)。

Zstd 在浮点数据上工作得很好,但压缩和解压缩(相对)慢。此外,解压缩必须在相对较大的值块中完成。

免责声明:我是 ALP 的创建者之一。我们做了一项研究,比较了许多现代浮点编码在各种数据集上的压缩率和编码/解码速度。结果可以在这里查看:https://dl.acm.org/doi/pdf/ 10.1145/3626717。第 16 页:表 4 压缩比比较和第 2 页:图 1 和图 2第 17 页:表 5 用于[解]压缩速度比较。

If decompression and compression speed are important, I would recommend against using prediction-based encodings (e.g. FPC, DFCM [mentioned in other answers]).

In recent years, a flurry of new lossless encodings tailored for floating-point data have appeared which have not been mentioned in this thread and have already been demonstrated to be miles better than DFCM and FPC.

XOR-based approaches: Compress by doing a bitwise XOR with a previous value in the sequence (no predictors). The two best of this kind are currently:

Gorilla (https://github.com/ghilesmeddour/gorilla-time-series-compression) is the most famous one of the XORing approaches, however it has already been demonstrated to be inferior than Chimp128.

ALP: https://github.com/cwida/ALP. Compress by losslessly transforming doubles into an integer representation (visually you could see it as moving the comma until there are no more decimals) and then encodes these integers using Frame-of-reference. This only works up to a certain precision. If floats are of high-precision (e.g. ML parameters), scheme changes to compress front-bits with dictionary encoding. ALP is insanely fast to decode (even faster than reading uncompressed data) while achieving higher compression ratios than all the other algorithms.

ALP is available (and is the default encoding) to compress double and float columns in DuckDB for quick testing.

Zstd works pretty well on floating point data however compression and decompression are (relatively) slow. Also decompression must be done in relatively big blocks of values.

Disclaimer: I am one of the creators of ALP. We did a study comparing a lot of modern floating-point encodings in terms of compression ratios and encoding/decoding speed on a variety of datasets. Results can be seen here: https://dl.acm.org/doi/pdf/10.1145/3626717. Page 16:Table 4 for compression ratios comparison and Page 2:Figure 1 & Page 17:Table 5 for [de]compression speed comparison.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文