最好的压缩算法？（参见下文最佳定义）

发布于 2024-07-08 19:35:54 字数 423 浏览 3 评论 0原文

我想以压缩格式存储以下元组的列表，我想知道哪种算法可以为我提供

最小的压缩大小
最快的解/压缩
权衡最佳值（权衡曲线的“膝盖”）

我的数据如下所示：

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

两个之一ints 指的是一个时间点，并且一个列表中结尾的数字很可能彼此接近。另一个 int 代表一个抽象 id，尽管它们也不会完全随机，但它们的值不太可能接近。双精度代表传感器读数，虽然这些值之间存在一些相关性，但它可能没有多大用处。

原文

I want to store a list of the following tuples in a compressed format and I was wondering which algorithm gives me

smallest compressed size
fastest de/compression
tradeoff optimum ("knee" of the tradeoff curve)

My data looks like this:

(<int>, <int>, <double>), 
(<int>, <int>, <double>), 
...
(<int>, <int>, <double>)

One of the two ints refers to a point in time and it's very likely that the numbers ending up in one list are close to each other. The other int represents an abstract id and the values are less likely to be close, although they aren't going to be completely random, either. The double is representing a sensor reading and while there is some correlation between the values, it's probably not of much use.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里寻她 2024-07-15 19:35:54

由于“时间”整数可以彼此接近，因此尝试仅存储第一个，然后将差异保存到之前的整数（增量编码）。您也可以对第二个 int 尝试相同的操作。

您可以尝试的另一件事是将数据从 [int1, int2, double], [int1, int2, double]... 重新组织为 [int1, int1...], [int2, int2...], [double ，双倍的...]。

要找出结果的压缩范围，您可以将数据写入文件并从 Christian Martelock 下载压缩器 CCM 这里。我发现它对于此类数据收集表现非常好。它使用相当快的上下文混合算法。您还可以将其与 WinZIP 等其他压缩器进行比较，或使用 zLib 等压缩库来看看它是否值得。

回复收藏 0 原文

断舍离 2024-07-15 19:35:54

这是大多数搜索引擎中使用的常见方案：存储值的增量并使用可变字节编码方案对增量进行编码，即，如果增量小于128，则可以仅用1个字节对其进行编码。有关详细信息，请参阅 Lucene 和 Protocol buffers 中的 vint。

这不会给你最好的压缩比，但通常是最快的编码/解码吞吐量。

回复收藏 0 原文

妄司 2024-07-15 19:35:54

如果我正确地阅读了问题，那么您只是想有效地存储数据。显然像压缩xml这样的简单选项很简单，但是还有更直接的二进制序列化方法。我首先想到的是 Google 的协议缓冲区。

例如，在带有 protobuf-net 的 C# 中，您只需创建一个类来保存数据：

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

然后只需通过 ProtoBuf.Serializer 类对 List 或 Foo[] 等进行序列化。

我并不是说它会像自己滚动一样节省空间，但它会非常接近。 Protocol buffer 规范很好地利用了空间（例如，对整数使用 base-128，这样小数字占用的空间就更少）。但尝试起来很简单，无需自己编写所有序列化代码。

这种方法不仅易于实现，还具有易于在其他架构中使用的优点，因为各种语言。与常规解压缩（GZip/DEFLATE/等）和/或基于 xml 的序列化相比，它使用的 CPU 也少得多。

If I'm reading the question correctly, you simply want to store the data efficiently. Obviously simple options like compressed xml are simple, but there are more direct binary serialization methods. One the leaps to mind is Google's protocol buffers.

For example, in C# with protobuf-net, you can simply create a class to hold the data:

[ProtoContract]
public class Foo {
    [ProtoMember(1)]
    public int Value1 {get;set;}
    [ProtoMember(2)]
    public int Value2 {get;set;}
    [ProtoMember(3)]
    public double Value3 {get;set;}
}

Then just [de]serialize a List or Foo[] etc, via the ProtoBuf.Serializer class.

I'm not claiming it will be quite as space-efficient as rolling your own, but it'll be pretty darned close. The protocol buffer spec makes fairly good use of space (for example, using base-128 for integers, such that small numbers take less space). But it would be simple to try it out, without having to write all the serialization code yourself.

This approach, as well as being simple to implement, also has the advantage of being simple to use from other architectures, since there are protocol buffers implementations for various languages. It also uses much less CPU than regular [de]compression (GZip/DEFLATE/etc), and/or xml-based serialization.

回复收藏 0 原文

春风十里 2024-07-15 19:35:54

按照已经建议的方式排序，然后存储

（第一个整数）
（第二个整数）
（双打）

转置矩阵。然后压缩

回复收藏 0 原文

月野兔 2024-07-15 19:35:54

大多数压缩算法对此类数据的效果同样糟糕。但是，在将数据提供给 gzip 或 deflate 之类的算法之前，您可以执行一些操作（“预处理”）来提高数据的可压缩性。尝试以下操作：

首先，如果可能，按升序对元组进行排序。首先使用抽象 ID，然后使用时间戳。假设您有来自同一个传感器的多个读数，相似的 id 将被放置在一起。

接下来，如果定期采取措施，则将时间戳替换为与前一个时间戳的差值（当然，传感器的第一个元组除外）。例如，如果以 5 分钟间隔采取所有措施，则两个时间戳之间的增量通常接近 300 秒。因此，时间戳字段将更加可压缩，因为大多数值都是相等的。

然后，假设测量值随时间稳定，将所有读数替换为同一传感器先前读数的增量。同样，大多数值将接近于零，因此更可压缩。

此外，由于其内部表示形式，浮点值非常不适合压缩。尝试将它们转换为整数。例如，温度读数很可能不需要超过两位小数。将值乘以 100 并四舍五入到最接近的整数。

回复收藏 0 原文

夏天碎花小短裙 2024-07-15 19:35:54

很好的答案，作为记录，我将把我投票的答案合并到我最终使用的方法中：

对数据进行排序和重新组织，使相似的数字彼此相邻，即首先按 id 排序，然后按时间戳并从 (,,), ... 重新排列为 ([,... ], [,... ], [,...]) （如
schnaader 和 Stephan Leclercq
在时间戳上使用增量编码（并且可能是其他值）正如schnaader 和 ididak
使用协议缓冲区进行序列化（无论如何我都会在应用程序中使用它们，因此不会添加依赖项或任何内容）。感谢 Marc Gravell 向我指出