压缩浮点数据

发布于 2024-12-22 17:37:14 字数 356 浏览 1 评论 0原文

是否有任何无损压缩方法可以应用于浮点时间序列数据,并且会显着优于将数据作为二进制写入文件并通过 gzip 运行它?

降低精度可能是可以接受的,但它必须以受控方式发生(即我必须能够设置必须保留多少位数字的界限)

我正在处理一些大型数据文件,这些文件是一系列相关的双精度数s,描述时间的函数(即值是相关的)。我通常不需要完整的双精度,但我可能需要的不仅仅是float

由于有专门的图像/音频无损方法,我想知道是否有专门针对这种情况的方法。

澄清:我正在寻找现有的实用工具,而不是描述如何实现此类内容的论文。速度与 gzip 相当的东西将是非常好的。

Are there any lossless compression methods that can be applied to floating point time-series data, and will significantly outperform, say, writing the data as binary into a file and running it through gzip?

Reduction of precision might be acceptable, but it must happen in a controlled way (i.e. I must be able to set a bound on how many digits must be kept)

I am working with some large data files which are series of correlated doubles, describing a function of time (i.e. the values are correlated). I don't generally need the full double precision but I might need more than float.

Since there are specialized lossless methods for images/audio, I was wondering if anything specialized exists for this situation.

Clarification: I am looking for existing practical tools rather than a paper describing how to implement something like this. Something comparable to gzip in speed would be excellent.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

流殇 2024-12-29 17:37:14

如果您想创建自己的简单算法,这里有一些想法:

  • 使用当前值与前一个值的异或以获得一组描述差异的位。
  • 将此差异分为两部分:一部分是“尾数位”,一部分是“指数位”。
  • 使用可变长度编码(每个值不同的位数/字节数)或您选择的任何压缩方法来保存这些差异。您可以对尾数和指数使用单独的流,因为尾数有更多的位需要压缩。
  • 如果您在两个不同的时间价值流源之间交替,这可能效果不佳。因此,您可能必须将每个源压缩到单独的流或块中。
  • 要失去精度,您可以从尾数中删除最低有效位或字节,同时保持指数不变。

Here are some ideas if you want to create your own simple algorithm:

  • Use xor of the current value with the previous value to obtain a set of bits describing the difference.
  • Divide this difference into two parts: one part is the "mantissa bits" and one part is the "exponent bits".
  • Use variable length encoding (different number of bits/bytes per value), or any compression method you choose, to save these differences. You might use separate streams for mantissas and exponents, since mantissas have more bits to compress.
  • This may not work well if you are alternating between two different time-value streams sources. So you may have to compress each source into a separate stream or block.
  • To lose precision, you can drop the least significant bits or bytes from the mantissa, while leaving the exponent intact.
云雾 2024-12-29 17:37:14

您可能想查看以下资源:

您可能还想尝试 Logluv 压缩的 TIFF 为此,我想我还没有我自己用过它们。

幻梦 2024-12-29 17:37:14

HDF5 人们使用的一种技术是“混洗”,即将 N 个浮点值的每个字节分组在一起。这更有可能为您提供重复的字节序列,这些字节序列可以通过 gzip 更好地压缩,例如< /a>.

我发现的第二种方法可以大大减少压缩 gzip 数据的大小,首先将数据转换为 float16(半精度)格式并再次返回到 float32。这会在输出流中产生大量零,从而在压缩后将文件大小缩小约 40-60%。一个微妙之处是最大 float16 值相当低,因此您可能需要首先缩放数据,例如在 python 中。

import numpy as np
import math

input = np.array(...)

# format can only hold 65504 maximum, so we scale input data
log2max = int(math.log(np.nanmax(input), 2))
scale = 2**(log2max - 14)
scaled = input * (1./scale)

# do the conversion to float16
temp_float16 = np.array(scaled, dtype=np.float16)
# convert back again and rescale
output = np.array(temp_float16, dtype=np.float32) * scale

一些测试表明,某些数据的输入和输出之间的平均绝对分数差约为 0.00019,最大值为 0.00048。这与尾数的2**11精度一致。

One technique that the HDF5 people use is "shuffling", where you group each byte for N floating point values together. This is more likely to give you repetitive sequences of bytes which will compress better with gzip, for example.

A second method I have found which greatly reduces the size of compressed gzipped data is to first convert the data to the float16 (half-precision) format and back again to float32. This produces a lot of zeros in the output stream which can shrink file sizes by around 40-60 per cent after compression. One subtlety is that the maximum float16 value is rather low, so you may want to scale your data first, e.g. in python

import numpy as np
import math

input = np.array(...)

# format can only hold 65504 maximum, so we scale input data
log2max = int(math.log(np.nanmax(input), 2))
scale = 2**(log2max - 14)
scaled = input * (1./scale)

# do the conversion to float16
temp_float16 = np.array(scaled, dtype=np.float16)
# convert back again and rescale
output = np.array(temp_float16, dtype=np.float32) * scale

Some tests suggest that the mean absolute fractional difference between input and output for some data are around 0.00019 with a maximum of 0.00048. This is in line with the 2**11 precision of the mantissa.

九局 2024-12-29 17:37:14

由于您声明需要介于“float”和“double”之间的精度:您可以将单精度和双精度浮点数中任意数量的最低有效位清零。 IEEE-754 浮点数字的二进制表示方式大致类似于 seeeffffffffff,它表示值

符号*1.fffffff*2^(eee)。

您可以将最低有效分数 (f) 位清零。对于单精度(32 位)浮点数,有 23 个小数位,您可以将其中最多 22 个清零。对于双精度(64 位),它是 52 个,最多 51 个。(如果将所有位清零) ,那么特殊值 NaN 和 +/-inf 将丢失)。

特别是如果数据表示十进制值,例如 1.2345,这将有助于数据压缩。这是因为1.2345不能精确地表示为二进制浮点值,而是表示为0x3ff3c083126e978d,这对数据压缩不友好。砍掉最低有效 24 位将得到 0x3ff3c08312000000,它仍然精确到大约 9 位十进制数字(在本例中,差异为 1.6e-9)。

如果您对原始数据执行此操作,然后存储后续数字之间的差异,如果原始数据变化缓慢,则压缩(通过 gzip)会更加友好。

下面是 C 中的一个示例:

#include <inttypes.h>

double double_trunc(double x, int zerobits)
{
  // mask is e.g. 0xffffffffffff0000 for zerobits==16
  uint64_t mask = -(1LL << zerobits);  
  uint64_t floatbits = (*((uint64_t*)(&x)));
  floatbits &= mask;
  x = * ((double*) (&floatbits));
  return x;
}

以及 python/numpy 中的一个示例:

import numpy as np

def float_trunc(a, zerobits):
    """Set the least significant <zerobits> bits to zero in a numpy float32 or float64 array.
    Do this in-place. Also return the updated array.
    Maximum values of 'nzero': 51 for float64; 22 for float32.
    """

at = a.dtype
assert at == np.float64 or at == np.float32 or at == np.complex128 or at == np.complex64
if at == np.float64 or at == np.complex128:
    assert nzero <= 51
    mask = 0xffffffffffffffff - (1 << nzero) + 1
    bits = a.view(np.uint64)
    bits &= mask
elif at == np.float32 or at == np.complex64:
    assert nzero <= 22
    mask = 0xffffffff - (1 << nzero) + 1
    bits = a.view(np.uint32)
    bits &= mask

return a

Since you state that you need a precision somewhere between 'float' and 'double': you can zero out any number of least significant bits in single- and double-precision floats. IEEE-754 floating point numers are represented binary roughly like seeefffffffff, which represents the value

sign*1.fffffff*2^(eee).

You can zero out the least significant fraction (f) bits. For single-precision (32-bit) floats, there are 23 fraction bits of which you can zero out up to 22. For double-precision (64-bit), it's 52 and up to 51. (If you zero out all bits, then the special values NaN and +/-inf will be lost).

Especially if the data represents decimal values such as 1.2345, this will help in data compression. That is because 1.2345 cannot be represented exactly as a binary floating point value, but rather as 0x3ff3c083126e978d, which is not friendly to data compression. Chopping off the least significant 24 bits will result in 0x3ff3c08312000000, which is still accurate to about 9 decimal digits (in this example, the difference is 1.6e-9).

If you do this on the raw data, and then store the differences between subsequential numbers, it will be even more compression-friendly (via gzip) if the raw data varies slowly.

Here is an example in C:

#include <inttypes.h>

double double_trunc(double x, int zerobits)
{
  // mask is e.g. 0xffffffffffff0000 for zerobits==16
  uint64_t mask = -(1LL << zerobits);  
  uint64_t floatbits = (*((uint64_t*)(&x)));
  floatbits &= mask;
  x = * ((double*) (&floatbits));
  return x;
}

And one in python/numpy:

import numpy as np

def float_trunc(a, zerobits):
    """Set the least significant <zerobits> bits to zero in a numpy float32 or float64 array.
    Do this in-place. Also return the updated array.
    Maximum values of 'nzero': 51 for float64; 22 for float32.
    """

at = a.dtype
assert at == np.float64 or at == np.float32 or at == np.complex128 or at == np.complex64
if at == np.float64 or at == np.complex128:
    assert nzero <= 51
    mask = 0xffffffffffffffff - (1 << nzero) + 1
    bits = a.view(np.uint64)
    bits &= mask
elif at == np.float32 or at == np.complex64:
    assert nzero <= 22
    mask = 0xffffffff - (1 << nzero) + 1
    bits = a.view(np.uint32)
    bits &= mask

return a
╰◇生如夏花灿烂 2024-12-29 17:37:14

由于您要求使用现有工具,也许 zfp 可以解决问题。

Since you're asking for existing tools, maybe zfp will do the trick.

写下不归期 2024-12-29 17:37:14

可用于浮点压缩的可能方法:

您可以使用 icapp 适用于 Linux 和 Windows 的工具。

Possible methods that can be used for floating-point compression:

You can test all these methods, with your data, using the icapp tool for linux and windows.

我的鱼塘能养鲲 2024-12-29 17:37:14

您可以使用霍尔特指数平滑算法(这是基于预测的压缩算法)。最初为数据分配一些权重并预测下一个值。如果两个数据相同,则通过执行 XOR 运算在 MSB 中产生许多零

You can use Holt's Exponential smoothing algorithm (which is prediction based compression algorithm). Initially assign some weight to the data and predict the next value.If both the data are same,it produces many zero's in the MSB by doing XOR operation

落墨 2024-12-29 17:37:14

TL;博士
尝试ALPhttps://github.com/cwida/ALP ( C++)。目前 DuckDB 中默认使用的无损浮点压缩开源算法来压缩 float/double。总体压缩比最高,速度快得惊人。

详细答案:

近年来,出现了一系列针对浮点数据的新无损编码,这些编码在本线程中没有提及,并且已经被证明比 FPC、zfp、 fpzip(在另一个答案中提到)和任何基于预测的方法或其他方法。

我将简要总结这些并链接可供使用的实现。

基于 XOR 的方法:通过与序列中的先前值(无预测变量)进行按位 XOR 进行压缩。目前此类中最好的两个是:

基于异或的方法有一些缺点。例如,它们无法压缩高精度数据。此外,尽管比预测方案更快,但它们的按值解压缩情况(分支代码)和数据依赖性(无法 SIMD 化)使它们无法真正快速。

ALPhttps://github.com/cwida/ALP (C++ )。通过无损地将双精度转换为整数表示形式进行压缩(从视觉上看,您可以将其视为移动逗号,直到不再有小数),然后使用参考框架对这些整数进行编码。这只适用于一定的精度。如果浮点数是高精度的(例如ML参数),则方案会更改为使用字典编码来压缩前位。 ALP 的解码速度非常快(数量级;甚至比读取未压缩数据还要快),同时实现了比所有其他算法更高的压缩比。

其他替代方案:Zstd (https://github.com/facebook/zstd) 在浮点数据上效果很好。然而,压缩和解压缩(相对)较慢。此外,解压缩必须在相对较大的值块中完成(如果您只想读取数据的几个值,这很糟糕)。

免责声明:我是 ALP 的创建者之一。我们做了一项研究,比较了许多现代浮点编码在各种数据集上的压缩率和编码/解码速度。结果可以在这里查看:https://dl.acm.org/doi/pdf/ 10.1145/3626717。第 16 页:表 4 压缩比比较和第 2 页:图 1 和图 2第 17 页:表 5 用于[解]压缩速度比较。

TL;DR
Try ALP: https://github.com/cwida/ALP (C++). Open-source algorithm for lossless floating-point compression currently used by default in DuckDB to compress float/double. Insanely fast with the highest compression ratios overall.

Detailed answer:

In recent years, a flurry of new lossless encodings tailored for floating-point data have appeared which have not been mentioned in this thread and have already been demonstrated to be miles better than FPC, zfp, fpzip (mentioned in another answers) and any prediction-based approach or other approaches.

I will briefly summarise these and link implementations ready for use.

XOR-based approaches: Compress by doing a bitwise XOR with a previous value in the sequence (no predictors). The two best of this kind are currently:

XOR-based approaches have a few shortcomings. For instance, they cannot compress high-precision data. Also, despite being faster than predictive schemes, their per-value [de]compression cases (branchy code) and data dependencies (cannot be SIMDized) stops them from being really fast.

ALP: https://github.com/cwida/ALP (C++). Compress by losslessly transforming doubles into an integer representation (visually you could see it as moving the comma until there are no more decimals) and then encodes these integers using Frame-of-reference. This only works up to a certain precision. If floats are of high-precision (e.g. ML parameters), scheme changes to compress front-bits with dictionary encoding. ALP is insanely fast to decode (by orders of magnitude; being even faster than reading uncompressed data) while achieving higher compression ratios than all the other algorithms.

Other alternative: Zstd (https://github.com/facebook/zstd) works pretty well on floating point data. However, compression and decompression are (relatively) slow. Also, decompression must be done in relatively big blocks of values (which is bad if you only want to read few values of your data).

Disclaimer: I am one of the creators of ALP. We did a study comparing a lot of modern floating-point encodings in terms of compression ratios and encoding/decoding speed on a variety of datasets. Results can be seen here: https://dl.acm.org/doi/pdf/10.1145/3626717. Page 16:Table 4 for compression ratios comparison and Page 2:Figure 1 & Page 17:Table 5 for [de]compression speed comparison.

多像笑话 2024-12-29 17:37:14

我刚刚找到了这个专注于压缩 FLOAT32 的链接:
https://computing.llnl.gov/projects/floating-point-compression

我下面的旧答案仍然适用于任何数据点

TAR ZSTD 似乎是最快且“最压缩”的压缩算法。您可以通过以下命令使用它:

tar --use-compress-program=zstd -cvf NameOFCompressedFile.tar.zst ./Files2BeCompressed   

要解压缩文件,请使用以下命令:

tar --use-compress-program=zstd -xvf NameOFCompressedFile.tar.zst

要列出文件的内容而不解压缩,请使用以下命令:

tar --use-compress-program=zstd --list --verbose --file=NameOFCompressedFile.tar.zst

据我所知,zstd 算法仅适用于 UNIX 操作系统。您可以在这里找到更多相关信息:
https://github.com/centminmod/tar-zstd/blob/ master/readme.md

I just found this link that focuses on compressing FLOAT32:
https://computing.llnl.gov/projects/floating-point-compression

My old answer below remains general to any data point

TAR ZSTD seems to be the fastest and the "compressiest" zipping algorithm out there. you can use it with a single command of:

tar --use-compress-program=zstd -cvf NameOFCompressedFile.tar.zst ./Files2BeCompressed   

For unzipping the files, use the command:

tar --use-compress-program=zstd -xvf NameOFCompressedFile.tar.zst

And to list the contents of the file without decompressing use the command:

tar --use-compress-program=zstd --list --verbose --file=NameOFCompressedFile.tar.zst

The zstd algorithm only works on UNIX operating systems as far as I know. you can find more info about it here:
https://github.com/centminmod/tar-zstd/blob/master/readme.md

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文