C 中的二进制数组压缩
我在c中有二进制数组,我想压缩该数组,请建议我压缩二进制数组的算法。我使用过 Lempel-Ziv-Welch (LZW) 算法,但它不适合我,因为我的数据没有重复。
I have binary array in c, I want to compress the array, kindly suggest me algorithm which compress binary array. I have used Lempel–Ziv–Welch (LZW) algorithm but its not suitable for me because there is no repetition in my data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
为什么不直接使用 libz 的 放气?作为额外的好处,libz 几乎可以在所有现有平台上使用。
或者更新的 LZMA?它在二进制数据压缩方面击败了 bzip2。
Why not just to use the libz's deflate? As added bonus, libz is available on pretty much every existing platform.
Or newer LZMA? It beats the bzip2 on binary data compression.
您可能没有重复,但数据中仍然可能存在可以利用的模式。不过,这需要更多地了解数据,而不是不存在重复。
如果您的数据实际上(或几乎)随机分布,那么压缩它将会遇到洋泾浜漏洞问题。这说明如果你只有 X 个洋泾浜语言和 Y 个可放入它们的孔,并且 X > 。是的,那你的空间不够了。在压缩中,这意味着您无法利用不存储某些洋泾浜语的能力,这些洋泾浜语与已经在洞中的洋泾浜语是同卵双胞胎,而只需在解压缩算法中留下注释来克隆该洋泾浜语。在霍夫曼编码中,所有洋泾浜语言都是洋泾浜语言库中洋泾浜语言的克隆。在其他几个压缩方案中,一些洋泾浜语可能是由其他洋泾浜语组成的大型洋泾浜语。
You may have no repetition, but there could still be a pattern in the data which could be taken advantage of. This requires knowing more about the data than that there is no repetition, though.
If you data is actually (or nearly) randomly distributed then compressing it is going to run into the Pidgin Hole problem. This states that if you only have X pidgins and Y holes to put them in, and X > Y, then you don't have enough room. In compression this means that you aren't able to take advantage of the ability to not store some pidgins which are identical twins of one already in a hole, and just leave a note to decompression algorithm to clone that pidgin. In Huffman coding, all pidgins are clones of pidgins in the pidgin library. In several other compression schemes some pidgins may be mega-pidgins made up of other pidgins.
您可以轻松地将空间减半!
由于您的二进制数据没有重复,因此您唯一的选项是 [0, 1], [1, 0]。任何更多的内容都会重复零或一。因此,你可以用 0 表示第一个集合,用 1 表示第二个集合。编码看起来像这样...
而解码将是...
对 haskell 语法感到抱歉,它在这个中更具可读性案件。这会将您的二元数组变成一元数组,并且可以存储在一半的空间中!魔法。
编辑:这忽略了 [0] 和 [1] 的小情况。如果需要处理这些(尽管您不应该真正压缩 1 位),则不可能获得比 100% 更好的压缩率。
You can cut the space in half easily!
Since your binary data has NO repetition, your only options are [0, 1], [1, 0]. Anything more would repeat either a zero or a one. Therefore, you can just represent the first set with a 0 and the second set with a 1. Encoding would look something like this...
And decoding would be...
Sorry for the haskell syntax, it's just so much more readable in this case. This turns your two element array into a one element array, and can be stored in half the space! Magic.
EDIT: This ignores the trivial case of [0] and [1]. If those need to be handled (although you shouldn't really be compressing 1 bit), it is impossible to get a better compression ratio than 100%.
如果您有二进制数据,您很可能会将它们视为类似于
char[]
的数据。在您的问题和评论中,您声明(几乎)没有重复,只有当您没有超过 256 个(char
)数据项时才有可能。但我猜你有更多的数据,所以压缩是可能的。如果数据项的频率分布不均匀,您可能会通过简单的霍夫曼编码。
为了给您提供更准确的建议,我们需要有关您要压缩的数据类型的更多详细信息。
If you have binary data, you will most likely treat them as something like a
char[]
. In your question and comment you state that there is (almost) no repetition, which is only possible if you do not have more than 256 (char
) data items.But I guess you have more data and so compression is possible. If the frequency of your data items is not evenly distributed, you may have some luck with a simple Huffman coding.
To give you more precise advice we need more details about the kind of data you want to compress.
或者:您的二进制数据代表某些值。您可以减少所有值的位数。您需要知道可能的范围并按位写入和读取数据。例如,如果您将值存储在只需要几个位的 uint32 中,这可能会节省大量空间。
Alternatively: Your binary data is representing certain values. You could reduce the bit count of all values. You need to know the possible range and write and read the data bitwise. This might save a lot of space if you for example store value in uint32 that only need a few bits.
压缩并不是魔法。如果您的数据完全随机,则没有可用的压缩算法可以使其更小。
大多数数据并不是完全随机的,但您需要找到表达它的最佳方式,以便可以检测到模式。图像和声音很常见,以至于已经开发出了标准算法,但在没有获得更多详细信息的情况下,无法对您的具体问题进行更多说明。
Compression is not magic. If your data is completely random, there is no compression algorithm available which can make it smaller.
Most data is not completely random, but it is up to you to discover the optimum way to express it so that the patterns can be detected. Images and sound are common enough that standard algorithms have been developed, but no more can be said about your specific problem without getting many more details.