最好的压缩算法? (参见下文最佳定义)

发布于 2024-07-08 19:35:54 字数 423 浏览 3 评论 0原文

我想以压缩格式存储以下元组的列表,我想知道哪种算法可以为我提供

  • 最小的压缩大小
  • 最快的解/压缩
  • 权衡最佳值(权衡曲线的“膝盖”)

我的数据如下所示:

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

两个之一ints 指的是一个时间点,并且一个列表中结尾的数字很可能彼此接近。 另一个 int 代表一个抽象 id,尽管它们也不会完全随机,但它们的值不太可能接近。 双精度代表传感器读数,虽然这些值之间存在一些相关性,但它可能没有多大用处。

I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me

  • smallest compressed size
  • fastest de/compression
  • tradeoff optimum ("knee" of the tradeoff curve)

My data looks like this:

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

梦里寻她 2024-07-15 19:35:54

由于“时间”整数可以彼此接近,因此尝试仅存储第一个,然后将差异保存到之前的整数(增量编码)。 您也可以对第二个 int 尝试相同的操作。

您可以尝试的另一件事是将数据从 [int1, int2, double], [int1, int2, double]... 重新组织为 [int1, int1...], [int2, int2...], [double , 双倍的...]。

要找出结果的压缩范围,您可以将数据写入文件并从 Christian Martelock 下载压缩器 CCM 这里。 我发现它对于此类数据收集表现非常好。 它使用相当快的上下文混合算法。 您还可以将其与 WinZIP 等其他压缩器进行比较,或使用 zLib 等压缩库来看看它是否值得。

Since the "time" ints can be close to each other, try to only store the first and after that save the difference to the int before (delta-coding). You can try the same for the second int, too.

Another thing you can try is to reorganize the data from [int1, int2, double], [int1, int2, double]... to [int1, int1...], [int2, int2...], [double, double...].

To find out the compression range your result will be in, you can write your data into a file and download the compressor CCM from Christian Martelock here. I found out that it performs very well for such data collections. It uses a quite fast context mixing algorithm. You can also compare it to other compressors like WinZIP or use a compression library like zLib to see if it is worth the effort.

断舍离 2024-07-15 19:35:54

这是大多数搜索引擎中使用的常见方案:存储值的增量并使用可变字节编码方案对增量进行编码,即,如果增量小于128,则可以仅用1个字节对其进行编码。 有关详细信息,请参阅 Lucene 和 Protocol buffers 中的 vint。

这不会给你最好的压缩比,但通常是最快的编码/解码吞吐量。

Here is a common scheme used in most search engines: store deltas of values and encode the delta using a variable byte encoding scheme, i.e. if the delta is less than 128, it can be encoded with only 1 byte. See vint in Lucene and Protocol buffers for details.

This will not give you the best compression ratio but usually the fastest for encoding/decoding throughput.

妄司 2024-07-15 19:35:54

如果我正确地阅读了问题,那么您只是想有效地存储数据。 显然像压缩xml这样的简单选项很简单,但是还有更直接的二进制序列化方法。 我首先想到的是 Google 的协议缓冲区

例如,在带有 protobuf-net 的 C# 中,您只需创建一个类来保存数据:

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

然后只需通过 ProtoBuf.Serializer 类对 List 或 Foo[] 等进行序列化。

我并不是说它会像自己滚动一样节省空间,但它会非常接近。 Protocol buffer 规范很好地利用了空间(例如,对整数使用 base-128,这样小数字占用的空间就更少)。 但尝试起来很简单,无需自己编写所有序列化代码。

这种方法不仅易于实现,还具有易于在其他架构中使用的优点,因为 各种语言。 与常规解压缩(GZip/DEFLATE/等)和/或基于 xml 的序列化相比,它使用的 CPU 也少得多。

If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.

For example, in C# with protobuf-net, you can simply create a class to hold the data:

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.

I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.

This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.

春风十里 2024-07-15 19:35:54

按照已经建议的方式排序,然后存储

(第一个整数)
(第二个整数)
(双打)

转置矩阵。 然后压缩

Sort as already proposed, then store

(first ints)
(second ints)
(doubles)

transposed matrix. Then compressed

月野兔 2024-07-15 19:35:54

大多数压缩算法对此类数据的效果同样糟糕。 但是,在将数据提供给 gzip 或 deflate 之类的算法之前,您可以执行一些操作(“预处理”)来提高数据的可压缩性。 尝试以下操作:

首先,如果可能,按升序对元组进行排序。 首先使用抽象 ID,然后使用时间戳。 假设您有来自同一个传感器的多个读数,相似的 id 将被放置在一起。

接下来,如果定期采取措施,则将时间戳替换为与前一个时间戳的差值(当然,传感器的第一个元组除外)。例如,如果以 5 分钟间隔采取所有措施,则两个时间戳之间的增量通常接近 300 秒。 因此,时间戳字段将更加可压缩,因为大多数值都是相等的。

然后,假设测量值随时间稳定,将所有读数替换为同一传感器先前读数的增量。 同样,大多数值将接近于零,因此更可压缩。

此外,由于其内部表示形式,浮点值非常不适合压缩。 尝试将它们转换为整数。 例如,温度读数很可能不需要超过两位小数。 将值乘以 100 并四舍五入到最接近的整数。

Most compression algorithms will work equally bad on such data. However, there are a few things ("preprocessing") that you can do to increase the compressibility of the data before feeding it to a gzip or deflate like algorithm. Try the following:

First, if possible, sort the tuples in ascending order. Use the abstract ID first, then the timestamp. Assuming you have many readings from the same sensor, similar ids will be placed close together.

Next, if the measures are taken at regular intervals, replace the timestamp with the difference from the previous timestamp (except for the very first tuple for a sensor, of course.) For example, if all measures are taken at 5 minutes intervals, the delta between two timestamps will usually be close to 300 seconds. The timestamp field will therefore be much more compressible, as most values are equal.

Then, assuming that the measured values are stable in time, replace all readings with a delta from the previous reading for the same sensor. Again, most values will be close to zero, and thus more compressible.

Also, floating point values are very bad candidates for compression, due to their internal representation. Try to convert them to an integer. For example, temperature readings most likely do not require more than two decimal digits. Multiply values by 100 and round to the nearest integer.

夏天碎花小短裙 2024-07-15 19:35:54

很好的答案,作为记录,我将把我投票的答案合并到我最终使用的方法中:

  1. 对数据进行排序和重新组织,使相似的数字彼此相邻,即首先按 id 排序,然后按时间戳并从 (,,), ... 重新排列为 ([,... ], [,... ], [,...]) (如
    schnaaderStephan Leclercq

  2. 在时间戳上使用增量编码(并且可能是其他值)正如schnaader 和 ididak

  3. 使用协议缓冲区进行序列化(无论如何我都会在应用程序中使用它们,因此不会添加依赖项或任何内容)。 感谢 Marc Gravell 向我指出

Great answers, for the record, I'm going to merge those I upvoted into the approach I'm finally using:

  1. Sort and reorganize the data so that similar numbers are next to each other, i. e. sort by id first, then by timestamp and rearrange from (<int1>, <int2>, <double>), ... to ([<int1>, <int1> ...], [<int2>, <int2> ... ], [<double>, <double> ...]) (as suggested by
    schnaader and Stephan Leclercq

  2. Use delta-encoding on the timestamps (and maybe on the other values) as suggested by schnaader and ididak

  3. Use protocol buffers to serialize (I'm going to use them anyway in the application, so that's not going to add dependencies or anything). Thanks to Marc Gravell for pointing me to it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文