我有一些非常大的整数数组,我想压缩它们。
然而,在java中执行此操作的方法是使用类似这样的东西 -
int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);
请注意,int数组首先需要由java转换为字节。
现在我知道这很快,但它仍然需要创建一个全新的字节数组并扫描整个原始 int 数组,将其转换为字节并将值复制到新的字节数组。
有没有办法跳过字节转换并使其立即压缩整数?
I have some extremely large array of integers which i would like to compress.
However the way to do it in java is to use something like this -
int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);
Note that the int array first needs to be converted to bytes by java.
Now I know that is fast but it still needs to create a whole new byte array and scan through the entire original int array converting it to bytes and copying the value to the new byte array.
Is there any way to skip the byte conversion and make it compress the integers right away?
发布评论
评论(6)
跳过
ObjectOutputStream
,直接将int
存储为每个字节
。 例如,DataOutputStream.writeInt
就是一种简单的方法。Skip the
ObjectOutputStream
and just store theint
s directly as fourbyte
s each.DataOutputStream.writeInt
for instance is an easy way to do it.唔。 通用压缩算法不一定能很好地压缩二进制值数组,除非存在大量冗余。 根据您对数据的了解,您可能会更好地开发自己的东西。
您实际上想要压缩的是什么?
Hmm. A general-purpose compression algorithm won't necessarily do a good job compressing an array of binary values, unless there's a lot of redundancy. You might do better to develop something of your own, based on what you know about the data.
What is it that you're actually trying to compress?
您可以使用 表示形式 //code.google.com/p/protobuf/" rel="nofollow noreferrer">协议缓冲区。 每个整数由 1-5 个字节表示,具体取决于其大小。
此外,新的“打包”表示意味着您基本上会得到一些“标题”来说明它有多大(以及它位于哪个字段),然后只是数据。 这可能也是
ObjectOutputStream
所做的,但它是 PB 中的一项最新创新:)请注意,这将根据大小进行压缩,而不是根据整数出现的频率进行压缩。 这将极大地影响它对你是否有用。
You could use the representation used by Protocol Buffers. Each integer is represented by 1-5 bytes, depending on its magnitude.
Additionally, the new "packed" representation means you get basically a bit of "header" to say how big it is (and which field it's in) and then just the data. That's probably what
ObjectOutputStream
does as well, but it's a recent innovation in PB :)Note that this will compress based on magnitude, not based on how often the integer has seen. That will dramatically affect whether it's useful for you or not.
字节数组不会为您节省太多内存,除非您将其设为保存无符号整数的字节数组,这在 Java 中非常危险。 它将用代码步骤检查的额外处理时间代替内存开销。 这对于数据存储来说可能是正确的,但是已经有数据存储解决方案了。
除非您这样做是出于序列化的目的,否则我认为您是在浪费时间。
A byte array is not going to save you much memory unless you make it a byte array holding unsigned ints, which is very dangerous in Java. It will replace memory overhead with extra processing time for the step checking of the code. This may be aright for data storage, but there already is data storage solution out there.
Unless you are doing this for serialization purposes, I think that you are wasting your time.
在您的示例中,您将压缩流写入 ByteArrayOutputStream。 您的压缩数组需要存在于某个地方,如果目标是内存,那么 ByteArrayOutputStream 是您可能的选择。 您还可以将流写入套接字或文件。 在这种情况下,您不会在内存中复制流。 如果您的阵列有 800MB 并且运行在 1GB 中,您可以使用您提供的示例轻松地将阵列写入压缩文件。 更改将用文件流替换 ByteArrayOutputStream。
ObjectOutputStream 格式实际上相当高效。 它不会在内存中复制数组,并且具有用于高效写入数组的特殊代码。
想要使用内存中的压缩数组吗? 您的数据适合稀疏数组吗? 当数据中存在较大间隙时,稀疏数组会很有用。
In your example, you are writing the compressed stream to the ByteArrayOutputStream. Your compressed array needs to exist somewhere, and if the destination is memory, then ByteArrayOutputStream is your likely choice. You could also write the stream to a socket or file. In that case, you wouldn't duplicate the stream in memory. If your array is 800MB and your running in a 1GB, you could easily write the array to a compressed file with the example you included. The change would be replacing the ByteArrayOutputStream with a file stream.
The ObjectOutputStream format is actually fairly efficient. It will not duplicate your array in memory, and has special code for efficiently writing arrays.
Are wanting to work with the compressed array in memory? Would you data lend itself well to a sparse array? Sparse array's are good when you have large gaps in your data.
如果保证整数数组没有重复项,则可以改用 java.util.BitSet。
由于其基本实现是一个位数组,每个位都指示 BitSet 中是否存在某个整数,因此其内存使用量非常低,因此需要较少的空间进行序列化。
If the array of ints is guaranteed to have no duplicates, you can use a java.util.BitSet, instead.
As its base implementation is an array of bits, with each bit indicating if a certain integer is present or not in the BitSet, its memory usage is quite low, therefore needing less space to be serialized.