在 Java 中实现对象的内存压缩
我们有这样的用例,我们希望压缩和存储对象(在内存中)并在需要时解压缩它们。
我们想要压缩的数据多种多样,从浮点向量到字符串再到日期。
有人可以建议任何好的压缩技术来做到这一点吗?
我们将压缩的便捷性和解压缩的速度视为最重要的因素。
谢谢。
We have this use case where we would like to compress and store objects (in-memory) and decompress them as and when required.
The data we want to compress is quite varied, from float vectors to strings to dates.
Can someone suggest any good compression technique to do this ?
We are looking at ease of compression and speed of decompression as the most important factors.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
如果您想压缩
MyObject
的实例,您可以让它实现Serializable
,然后将对象流式传输到压缩的字节数组中,如下所示:然后解压缩您的
byte []
回到对象中:If you want to compress instances of
MyObject
you could have it implementSerializable
and then stream the objects into a compressed byte array, like so:Then to uncompress your
byte[]
back into the objects:与之前的答案类似,但我建议您使用 DeflatorOutputStream 和 InflatorInputStream,因为它们比替代方案更简单/更快/更小。它较小的原因是它只进行压缩,而替代方案则添加文件格式扩展,例如 CRC 检查和标头。
如果大小很重要,您可能希望拥有自己的简单序列化。这是因为 ObjectOutputStream 会产生很大的开销,使小对象变大。 (对于较大的对象,尤其是在压缩时,它会有所改善)
例如,
Integer
需要 81 字节,而压缩对于如此少量的字节没有太大帮助。大幅削减这一点是可能的。Similar to previous answers, except I suggest you use DeflatorOutputStream and InflatorInputStream as these are simpler/faster/smaller than the alternatives. The reason it is smaller is it just does the compression whereas the alternatives add file format extensions like CRC checks and headers.
If size is important, you might like to have a simple serialization of your own. This is because ObjectOutputStream has a significant overhead making small objects much larger. (It improves for larger object especially when compressed)
e.g. an
Integer
takes 81 byte, and compression won't help much for such a small number of bytes. It is possible to cut this significantly.一项建议可能是使用以下流的组合:
One proposal could be to use a combination of the following streams:
Java 中序列化对象的压缩通常不太好……不太好。
首先你需要了解Java对象有很多不需要的附加信息。如果你有数百万个对象,那么你就会有数百万次这样的开销。
作为一个例子,我们有一个双链表。每个元素都有一个前一个和一个下一个指针+您存储一个长值(时间戳)+用于交互类型的字节和用于用户ID的两个整数。由于我们使用指针压缩,所以我们是 6Bytes * 2 + 8 + 4 * 2= 28Bytes。 Java 添加 8 字节 + 12 字节作为填充。这使得每个元素 48 字节。
现在我们创建 1000 万个列表,每个列表有 20 个元素(过去三年用户点击事件的时间序列(我们想要找到模式))。
所以我们有 200Million * 48 Bytes 元素 = 10GB 内存(好吧,不多)。
好吧,除了垃圾收集会杀死我们以及 JDK 内部的开销猛增之外,我们最终还是选择了 10GB 内存。
现在让我们使用我们自己的内存/对象存储。我们将其存储为按列数据表,其中每个对象实际上是一行。因此,我们在时间戳、上一个、下一个、userIdA 和 userIdB 集合中有 2 亿行。
上一个和下一个现在指向行 ID 并变为 4 字节(如果我们超过 40 亿个条目(不太可能),则变为 5 字节)。
所以我们有 8 + 4 + 4 + 4 + 4 => 24 * 200 Mio = 4.8GB + 无 GC 问题。
由于时间戳列以最小最大方式存储时间戳,并且我们的时间戳都在三年内,因此我们只需要 5 个字节来存储每个时间戳。由于指针现在是相对存储的(+ 和 -),并且由于点击系列在时间上密切相关,因此上一个和下一个平均只需要 2 个字节,对于用户 ID,我们使用字典,因为点击系列大约有 50 万个用户我们每个只需要三个字节。
所以我们现在有 5 + 2 + 2 + 3 + 3 => 15 * 200Mio => 3GB + 4 * 500k * 4 字典 = 8MB = 3GB + 8MB。听起来和 10GB 不一样吧?
但我们还没有完成。由于我们现在除了行和数据之外没有任何对象,因此我们将每个系列存储为表行,并使用特殊列作为数组集合,这些列实际上存储 5 个值和一个指向下五个值的指针 + 一个前一个值的指针。
所以我们有 10Mio 列表,每个列表有 20 个条目(因为我们有开销),每个列表有 20 * (5 + 3 + 3) + 4 * 6 (让我们添加一些部分填充元素的开销)=> 20*11+5*6=> 250 * 10Mio => 2.5GB + 我们可以比遍历元素更快地访问数组。
但是,嘿,它还没有结束...时间戳现在相对存储,每个条目只需要 3 个字节 + 第一个条目需要 5 个字节。 ->所以我们节省了更多 20 * 9 + 2 + 5 * 6 => 212 * 10Mio => 2,12 GB。现在使用 gzip 将其全部存储到内存中,结果为 1GB,因为我们可以将其全部线性存储,首先存储数组的长度、所有时间戳、所有用户 ID,这使得比特中存在可压缩的模式非常重要。由于我们使用字典,因此我们只是根据每个 userId 属于系列的概率对其进行排序。
由于一切都是表,因此您可以以几乎读取的速度反序列化所有内容,因此现代 SSD 上的 1GB 加载时间为 2 秒。尝试使用序列化/反序列化,您可以听到内心用户的哭泣。
因此,在压缩序列化数据之前,请将其存储在表中,检查每个列/属性是否可以逻辑压缩。最后享受其中的乐趣。
请记住,今天 1TB (ECC) 的价格为 10k。没什么。还有 1TB SSD 340 欧元。因此,除非确实必要,否则不要在这个问题上浪费时间。
Compression of searilized objects in Java is usually not well... not so good.
First of all you need to understand that a Java object has a lot of additional information not needed. If you have millions of objects you have this overhead millions of times.
As an example lets us a double linked list. Each element has a previous and a next pointer + you store a long value (timestamp) + byte for the kind of interaction and two integers for the user ids. Since we use pointer compression we are 6Bytes * 2 + 8 + 4 * 2= 28Bytes. Java adds 8 Bytes + 12bytes for the padding. This makes 48Bytes per Element.
Now we create 10 million lists with 20 Elements each (time series of click events of users for the last three years (we want to find patterns)).
So we have 200Million * 48 Bytes of elements = 10GB memory (ok not much).
Ok beside the Garbage collection kills us and the overhead inside the JDK skyrocks, we end with 10GB memory.
Now lets use our own memory / object storage. We store it as a column wise data table where each object is actually a single row. So we have 200Million rows in a timestamp, previous, next, userIdA and userIdB collection.
Previous and next are now point to row ids and become 4byte (or 5bytes if we exceed 4billion entries (unlikely)).
So we have 8 + 4 + 4 + 4 + 4 => 24 * 200 Mio = 4.8GB + no GC problem.
Since the timestamp column stores the timestamps in a min max fashion and our timestamps all are within three years, we only need 5bytes to store each of the timestamps. Since the pointer are now stored relative (+ and -) and due the click series are timely closely related we only need 2bytes in average for both previous and next and for the user ids we use a dictionary since the click series are for roughly 500k users we only need three bytes each.
So we now have 5 + 2 + 2 + 3 + 3 => 15 * 200Mio => 3GB + Dictionary of 4 * 500k * 4 = 8MB = 3GB + 8MB. Sounds different to 10GB right?
But we are not finished yet. Since we now have no objects but rows and datas, we store each series as a table row and use special columns being collections of array that actually are storing 5 values and a pointer to the next five values + a pointer previous.
So we have 10Mio lists with 20 enries each (since we have overhead), we have per list 20 * (5 + 3 + 3) + 4 * 6 (lets add some overhead of partly filled elements) => 20 * 11 + 5 * 6 => 250 * 10Mio => 2,5GB + we can access the arrays faster than walking elements.
But hey its not over yet... the timestamps are now relatively stored only requiring 3 bytes per entry + 5 at the first entry. -> so we save a lot more 20 * 9 + 2 + 5 * 6 => 212 * 10Mio => 2,12 GB. And now storing it all to memory using gzip it and we result in 1GB since we can store it all lineary first storing the length of the array, all timestamps, all user ids making it very highly that there are patterns in the bits to be compressable. Since we use a dictionary we just sort it according the propability of each userId to be part of a series.
And since everything is a table you can deserialize everything in almost read speed so 1GB on a modern SSD cost 2 second to load. Try this with serialization / deserialization and you can hear inner user cry.
So before you ever compress serialized data, store it in tables, check each column / property if it can be logically be compressed. And finally have fun with it.
And remember 1TB (ECC) cost 10k today. Its nothing. And 1TB SSD 340 Euro. So do not waste your time on that issue unless you really have to.
我所知道的最好的压缩技术是ZIP。 Java 支持 ZipStream。您所需要做的就是将对象序列化为字节数组,然后对其进行压缩。
提示:使用ByteArrayOutputStream、DataStream、ZipOutputStream。
The best compression technology I know is ZIP. Java supports ZipStream. All you need is to serialize your object into byte array and then zip it.
Tips: Use ByteArrayOutputStream, DataStream, ZipOutputStream.
JDK 中实现了多种压缩算法。检查
[java.util.zip](http://download.oracle.com/javase/6/docs/api/java/util/zip/package-summary.html)
了解所有算法实施的。然而,压缩所有数据可能并不是一件好事。例如,序列化的空数组可能有几十个字节长,因为底层类的名称位于序列化数据流中。此外,大多数压缩算法都是为了消除大数据块中的冗余而设计的。对于中小型 Java 对象,您可能获得的收益很少或根本没有。There are various compression algorithm implemented in the JDK. Check the
[java.util.zip](http://download.oracle.com/javase/6/docs/api/java/util/zip/package-summary.html)
for all algorithm implemented. However it may not be a good thing to compress all your data. For instance a serialized empty array may be several dozen of bytes long as the name of the underlying class is in the serialized data stream. Also most compression algorithm are designed to remove redundancy from large data blocks. On small to medium Java objects you'll probably have very little or no gain at all.这是一个棘手的问题:
首先,使用 ObjectOutputStream 可能不是答案。流格式包含很多与类型相关的元数据。如果您要序列化小对象,即使您实现了自定义序列化方法,强制元数据也会使压缩算法难以“收支平衡”。
使用添加最少(或不添加)类型信息的 DataOutputStream 会得到更好的结果,但混合数据通常无法使用通用压缩算法进行压缩。
为了获得更好的压缩效果,您可能需要查看正在压缩的数据的属性。例如:
Date
对象的精度为 1 天,则可以将其表示为int
值。int
值的序列可以进行游程编码,或者如果它们具有正确的属性,则可以进行增量编码。无论采用何种方式,您都需要做大量的工作才能获得有价值的压缩量。 IMO,更好的想法是将对象写入数据库、数据存储或文件,并使用缓存将经常使用的对象保留在内存中。
This is a tricky problem:
First, using
ObjectOutputStream
is probably not the answer. The stream format includes a lot of type-related metadata. If you are serializing small objects, the mandatory metadata will make it hard for the compression algorithm to "break even", even if you implement custom serialization methods.Using
DataOutputStream
with minimal (or no) added type information will give a better result, but mixed data is not generally that compressible using a general purpose compression algorithms.For better compression, you may need to look at the properties of the data that you are compressing. For instance:
Date
objects could be represented asint
values if you know that have a precision of 1 day.int
values could be run-length encoded, or delta-encoded if they have the right properties.However way you do it, you will need to do a serious amount of work to get a worthwhile amount of compression. IMO, a better idea would be to write the objects to a database, datastore or file and use caching to keep frequently used objects in memory.
如果需要压缩任意对象,一种可能的方法是将对象序列化为字节数组,然后使用例如 DEFLATE 算法(GZIP 使用的算法)对其进行压缩。当您需要该对象时,可以对其进行解压缩和反序列化。不确定这有多有效,但这将是完全通用的。
If you need to compress arbitrary objects, a possible approach is to serialize the object into a byte array, and then use e.g. the DEFLATE algorithm (the one used by GZIP) to compress it. When you need the object, you can decompress and deserialize it. Not sure about how efficient this would be, but it will be completely general.