对于小数据块有好的压缩算法吗? (大小约2k)
我有一个系统,其中一台机器以包含整数和长整型数组的对象的形式生成小块数据。这些块被传递到另一台服务器,该服务器又将它们分发到其他地方。
我想压缩这些对象,以便减少直通服务器上的内存负载。我知道像 deflate 这样的压缩算法需要构建一个字典,所以类似的东西对于这么小的数据并不能真正起作用。
是否有任何算法可以有效地压缩这样的数据?
如果没有,我可以做的另一件事是将这些块分批放入对象数组中,并在数组达到一定大小时对其进行压缩。 但我不愿意这样做,因为我必须更改现有系统中的接口。单独压缩它们不需要任何界面更改,这一切都是这样设置的。
我认为这并不重要,但目标系统是 Java。
编辑:Elias gamma 编码最适合这种情况吗?
谢谢
I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您认为将数据包减少到熵水平是最好的,您可以尝试简单的霍夫曼压缩。
为了尽早了解压缩效果,您可以通过 Huff0 传递数据包:
http://fastcompression.blogspot.com/p/huff0-range0-entropy -coders.html
它是一个简单的0阶哈夫曼编码器。所以结果才会有代表性。
对于如何有效地使用数据特征的更具体想法,建议描述一下数据包包含哪些数据以及它是如何生成的(正如您在评论中所做的那样,所以它们是整数(4字节? )和长整型(8 个字节?)),然后提供一个或几个样本。
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
听起来您目前正在研究通用压缩算法。压缩小块数据的最有效方法是构建一个了解数据结构的专用压缩器。
重要的是,您需要将您使用的编码与您期望从数据中获得的值的分布相匹配:为了从 Elias gamma 编码中获得良好的结果,您需要确保您编码的值是较小的正整数......
如果同一块中的不同整数不是完全独立的(例如,如果您的数组表示时间序列),您可以使用它来改进压缩(例如,时间序列中的连续值之间的差异往往很小) 有符号整数)。但是,由于每个块都需要独立压缩,因此您将无法利用连续块之间的差异。
如果您担心您的压缩器可能会变成“扩展器”,您可以添加一个初始标志来指示数据是压缩的还是未压缩的。然后,在最坏的情况下,您的数据根本不适合您的压缩模型,您始终可以推送并发送未压缩的版本;最坏情况的开销是标志的大小......
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma 编码实际上可能会增加数据的大小。
您已经有了数字的上限(无论适合 4 字节或可能 8 字节 int/long 的数字)。此方法对数字的长度进行编码,后跟您的数字(可能不是您想要的)。如果你得到许多小值,它可能会使事情变得更小。如果您还获得较大的值,则可能会增加大小(8 字节无符号最大值将几乎变为两倍大)。
查看数据包的熵。如果接近最大值,压缩将毫无用处。否则,尝试不同的 GP 压缩器。不过,我不确定压缩和解压缩所花费的时间是否值得减少大小。
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
我会仔细查看压缩库的选项,例如 deflateSetDictionary() 和 http 中的标志 Z_FILTERED ://www.zlib.net/manual.html。如果您可以提前向发送方和接收方分发(或在源代码中硬连线)一个商定的字典,并且该字典代表真实数据,那么您应该获得可观的压缩节省。哎呀 - 在 Java 中查看 java.util.zip.Deflater.setDictionary() 和 FILTERED。
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.