压缩浮点数据
是否有任何无损压缩方法可以应用于浮点时间序列数据,并且会显着优于将数据作为二进制写入文件并通过 gzip 运行它?
降低精度可能是可以接受的,但它必须以受控方式发生(即我必须能够设置必须保留多少位数字的界限)
我正在处理一些大型数据文件,这些文件是一系列相关的双精度数s,描述时间的函数(即值是相关的)。我通常不需要完整的双精度
,但我可能需要的不仅仅是float
。
由于有专门的图像/音频无损方法,我想知道是否有专门针对这种情况的方法。
澄清:我正在寻找现有的实用工具,而不是描述如何实现此类内容的论文。速度与 gzip 相当的东西将是非常好的。
Are there any lossless compression methods that can be applied to floating point time-series data, and will significantly outperform, say, writing the data as binary into a file and running it through gzip?
Reduction of precision might be acceptable, but it must happen in a controlled way (i.e. I must be able to set a bound on how many digits must be kept)
I am working with some large data files which are series of correlated double
s, describing a function of time (i.e. the values are correlated). I don't generally need the full double
precision but I might need more than float
.
Since there are specialized lossless methods for images/audio, I was wondering if anything specialized exists for this situation.
Clarification: I am looking for existing practical tools rather than a paper describing how to implement something like this. Something comparable to gzip in speed would be excellent.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
如果您想创建自己的简单算法,这里有一些想法:
Here are some ideas if you want to create your own simple algorithm:
您可能想查看以下资源:
算法及其实现、科学浮点数据的快速无损压缩和双精度高吞吐量压缩
浮点数据
您可能还想尝试 Logluv 压缩的 TIFF 为此,我想我还没有我自己用过它们。
You might want to have a look at these resources:
Algorithm and its Implementation, Fast Lossless Compression of Scientific Floating-Point Data and High Throughput Compression of Double-Precision
Floating-Point Data
You might also want to try Logluv-compressed TIFF for this, thought I haven't used them myself.
HDF5 人们使用的一种技术是“混洗”,即将 N 个浮点值的每个字节分组在一起。这更有可能为您提供重复的字节序列,这些字节序列可以通过 gzip 更好地压缩,例如< /a>.
我发现的第二种方法可以大大减少压缩 gzip 数据的大小,首先将数据转换为 float16(半精度)格式并再次返回到 float32。这会在输出流中产生大量零,从而在压缩后将文件大小缩小约 40-60%。一个微妙之处是最大 float16 值相当低,因此您可能需要首先缩放数据,例如在 python 中。
一些测试表明,某些数据的输入和输出之间的平均绝对分数差约为 0.00019,最大值为 0.00048。这与尾数的2**11精度一致。
One technique that the HDF5 people use is "shuffling", where you group each byte for N floating point values together. This is more likely to give you repetitive sequences of bytes which will compress better with gzip, for example.
A second method I have found which greatly reduces the size of compressed gzipped data is to first convert the data to the float16 (half-precision) format and back again to float32. This produces a lot of zeros in the output stream which can shrink file sizes by around 40-60 per cent after compression. One subtlety is that the maximum float16 value is rather low, so you may want to scale your data first, e.g. in python
Some tests suggest that the mean absolute fractional difference between input and output for some data are around 0.00019 with a maximum of 0.00048. This is in line with the 2**11 precision of the mantissa.
由于您声明需要介于“float”和“double”之间的精度:您可以将单精度和双精度浮点数中任意数量的最低有效位清零。 IEEE-754 浮点数字的二进制表示方式大致类似于
seeeffffffffff
,它表示值符号*1.fffffff*2^(eee)。
您可以将最低有效分数 (f) 位清零。对于单精度(32 位)浮点数,有 23 个小数位,您可以将其中最多 22 个清零。对于双精度(64 位),它是 52 个,最多 51 个。(如果将所有位清零) ,那么特殊值 NaN 和 +/-inf 将丢失)。
特别是如果数据表示十进制值,例如 1.2345,这将有助于数据压缩。这是因为1.2345不能精确地表示为二进制浮点值,而是表示为
0x3ff3c083126e978d
,这对数据压缩不友好。砍掉最低有效 24 位将得到0x3ff3c08312000000
,它仍然精确到大约 9 位十进制数字(在本例中,差异为 1.6e-9)。如果您对原始数据执行此操作,然后存储后续数字之间的差异,如果原始数据变化缓慢,则压缩(通过 gzip)会更加友好。
下面是 C 中的一个示例:
以及 python/numpy 中的一个示例:
Since you state that you need a precision somewhere between 'float' and 'double': you can zero out any number of least significant bits in single- and double-precision floats. IEEE-754 floating point numers are represented binary roughly like
seeefffffffff
, which represents the valuesign*1.fffffff*2^(eee).
You can zero out the least significant fraction (f) bits. For single-precision (32-bit) floats, there are 23 fraction bits of which you can zero out up to 22. For double-precision (64-bit), it's 52 and up to 51. (If you zero out all bits, then the special values NaN and +/-inf will be lost).
Especially if the data represents decimal values such as 1.2345, this will help in data compression. That is because 1.2345 cannot be represented exactly as a binary floating point value, but rather as
0x3ff3c083126e978d
, which is not friendly to data compression. Chopping off the least significant 24 bits will result in0x3ff3c08312000000
, which is still accurate to about 9 decimal digits (in this example, the difference is 1.6e-9).If you do this on the raw data, and then store the differences between subsequential numbers, it will be even more compression-friendly (via gzip) if the raw data varies slowly.
Here is an example in C:
And one in python/numpy:
由于您要求使用现有工具,也许 zfp 可以解决问题。
Since you're asking for existing tools, maybe zfp will do the trick.
可用于浮点压缩的可能方法:
浮点转置 4xN,双精度转置 8xN + lz77
实现:TurboTranspose 中的浮点压缩
另请参见误差有界有损压缩
预测器(例如有限上下文方法)+编码(例如“整数压缩”)。
实现:TurboPFor 中的浮点压缩
如果可能,将所有浮点数转换为整数(例如. 1.63 -> 163),
然后使用整数压缩
实现:整数压缩
您可以使用 icapp 适用于 Linux 和 Windows 的工具。
Possible methods that can be used for floating-point compression:
Transpose 4xN for float and 8xN for double + lz77
Implementation: Floating point compression in TurboTranspose
see also error-bounded lossy compression
Predictor (ex. Finite Context Method) + encoding (ex. "integer compression").
Implementation: Floating point compression in TurboPFor
when possible, convert all floating point numbers to integers (ex. 1.63 -> 163),
then use integer compression
Implementation: Integer compression
You can test all these methods, with your data, using the icapp tool for linux and windows.
您可以使用霍尔特指数平滑算法(这是基于预测的压缩算法)。最初为数据分配一些权重并预测下一个值。如果两个数据相同,则通过执行 XOR 运算在 MSB 中产生许多零
You can use Holt's Exponential smoothing algorithm (which is prediction based compression algorithm). Initially assign some weight to the data and predict the next value.If both the data are same,it produces many zero's in the MSB by doing XOR operation
TL;博士
尝试ALP:https://github.com/cwida/ALP ( C++)。目前 DuckDB 中默认使用的无损浮点压缩开源算法来压缩
float
/double
。总体压缩比最高,速度快得惊人。详细答案:
近年来,出现了一系列针对浮点数据的新无损编码,这些编码在本线程中没有提及,并且已经被证明比 FPC、zfp、 fpzip(在另一个答案中提到)和任何基于预测的方法或其他方法。
我将简要总结这些并链接可供使用的实现。
基于 XOR 的方法:通过与序列中的先前值(无预测变量)进行按位 XOR 进行压缩。目前此类中最好的两个是:
基于异或的方法有一些缺点。例如,它们无法压缩高精度数据。此外,尽管比预测方案更快,但它们的按值解压缩情况(分支代码)和数据依赖性(无法 SIMD 化)使它们无法真正快速。
ALP:https://github.com/cwida/ALP (C++ )。通过无损地将双精度转换为整数表示形式进行压缩(从视觉上看,您可以将其视为移动逗号,直到不再有小数),然后使用参考框架对这些整数进行编码。这只适用于一定的精度。如果浮点数是高精度的(例如ML参数),则方案会更改为使用字典编码来压缩前位。 ALP 的解码速度非常快(数量级;甚至比读取未压缩数据还要快),同时实现了比所有其他算法更高的压缩比。
其他替代方案:Zstd (https://github.com/facebook/zstd) 在浮点数据上效果很好。然而,压缩和解压缩(相对)较慢。此外,解压缩必须在相对较大的值块中完成(如果您只想读取数据的几个值,这很糟糕)。
免责声明:我是 ALP 的创建者之一。我们做了一项研究,比较了许多现代浮点编码在各种数据集上的压缩率和编码/解码速度。结果可以在这里查看:https://dl.acm.org/doi/pdf/ 10.1145/3626717。第 16 页:表 4 压缩比比较和第 2 页:图 1 和图 2第 17 页:表 5 用于[解]压缩速度比较。
TL;DR
Try ALP: https://github.com/cwida/ALP (C++). Open-source algorithm for lossless floating-point compression currently used by default in DuckDB to compress
float
/double
. Insanely fast with the highest compression ratios overall.Detailed answer:
In recent years, a flurry of new lossless encodings tailored for floating-point data have appeared which have not been mentioned in this thread and have already been demonstrated to be miles better than FPC, zfp, fpzip (mentioned in another answers) and any prediction-based approach or other approaches.
I will briefly summarise these and link implementations ready for use.
XOR-based approaches: Compress by doing a bitwise XOR with a previous value in the sequence (no predictors). The two best of this kind are currently:
XOR-based approaches have a few shortcomings. For instance, they cannot compress high-precision data. Also, despite being faster than predictive schemes, their per-value [de]compression cases (branchy code) and data dependencies (cannot be SIMDized) stops them from being really fast.
ALP: https://github.com/cwida/ALP (C++). Compress by losslessly transforming doubles into an integer representation (visually you could see it as moving the comma until there are no more decimals) and then encodes these integers using Frame-of-reference. This only works up to a certain precision. If floats are of high-precision (e.g. ML parameters), scheme changes to compress front-bits with dictionary encoding. ALP is insanely fast to decode (by orders of magnitude; being even faster than reading uncompressed data) while achieving higher compression ratios than all the other algorithms.
Other alternative: Zstd (https://github.com/facebook/zstd) works pretty well on floating point data. However, compression and decompression are (relatively) slow. Also, decompression must be done in relatively big blocks of values (which is bad if you only want to read few values of your data).
Disclaimer: I am one of the creators of ALP. We did a study comparing a lot of modern floating-point encodings in terms of compression ratios and encoding/decoding speed on a variety of datasets. Results can be seen here: https://dl.acm.org/doi/pdf/10.1145/3626717. Page 16:Table 4 for compression ratios comparison and Page 2:Figure 1 & Page 17:Table 5 for [de]compression speed comparison.
我刚刚找到了这个专注于压缩 FLOAT32 的链接:
https://computing.llnl.gov/projects/floating-point-compression
我下面的旧答案仍然适用于任何数据点
TAR ZSTD 似乎是最快且“最压缩”的压缩算法。您可以通过以下命令使用它:
要解压缩文件,请使用以下命令:
要列出文件的内容而不解压缩,请使用以下命令:
据我所知,zstd 算法仅适用于 UNIX 操作系统。您可以在这里找到更多相关信息:
https://github.com/centminmod/tar-zstd/blob/ master/readme.md
I just found this link that focuses on compressing FLOAT32:
https://computing.llnl.gov/projects/floating-point-compression
My old answer below remains general to any data point
TAR ZSTD seems to be the fastest and the "compressiest" zipping algorithm out there. you can use it with a single command of:
For unzipping the files, use the command:
And to list the contents of the file without decompressing use the command:
The zstd algorithm only works on UNIX operating systems as far as I know. you can find more info about it here:
https://github.com/centminmod/tar-zstd/blob/master/readme.md