确定用于一系列字节的最佳压缩算法
对于我的个人项目,我正在编写一个小类来压缩和解压缩一种相当晦涩的格式。 我有完整的规格,但这不是问题所在。
首先,这种“格式”使用一组 6 种不同的压缩类型以及未压缩的字节数据块。 格式为 RLE、RLE 的分支,其中数字按每个字节递增(例如 3、4、5...)、16 位 RLE、LZ 复制、反向 LZ 复制和 LZ 复制 Xor' d 与 255。这不是最干净的规格,但也不是我设计的。
我的压缩例程应该接受 1 到 65535 字节之间的数组,并(希望)尽可能地压缩它。 我之前的尝试只是简单地计算出,从未压缩流中的任何索引开始,上述哪种压缩技术将提供最佳压缩,然后压缩该方法将压缩到压缩字节数组的字节数,然后再从新的“未压缩”索引,例如:
{0,0,0,1,2,3,4}
该算法将首先读取开头有三个零,然后输出规范使用的它们的 RLE 编码,然后从第四个元素开始,将读取递增的 RLE 将覆盖'1,2,3,4' 足够好并在返回之前对其进行压缩。
总结的问题是,在尝试找出最佳使用规范时,即使在小型(20-30)字节数组上,例程也非常慢。 任何人都可以帮助我提供有关如何优化此问题的提示,或者我是否可以提供更多信息来提供帮助?
For a personal project of mine, I'm writing up a small class to compress to and decompress from a rather obscure format. I've got the full spec, but that's not where the problem is.
First, this 'format' uses a set of 6 different compression types as well as uncompressed blocks of byte data. The formats are RLE, an offshoot of RLE where the number increments each byte (e.g. 3, 4, 5, ...), a 16-bit RLE, LZ-Copy, a reverse LZ-copy, and LZ-Copy Xor'd with 255. It's not the cleanest of specs, but I didn't design it either.
My compression routine is supposed to take in an array of anywhere from 1 to 65535 bytes, and (hopefully) compress it as much as possible. My previous attempt at this simply calculated out, starting from any index in the uncompressed stream, which of the compression techniques above will provide the best compression, and then compresses however many bytes that method will compress to the array of compressed bytes before repeating from the new 'uncompressed' index, e.g.:
{0,0,0,1,2,3,4}
The algorithm would first read that there were three zeroes at the start, and then output the RLE encoding for them that the spec used, and then starting from the fourth element, would read that incrementing RLE would cover the '1,2,3,4' well enough and compress that before returning.
The problem summarized is that while trying to find out the best spec to use, the routine is very slow even on small (20-30) byte arrays. Can anyone help with tips on how I might look at optimizing this, or if there's any more information I could provide to help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
听起来您想要做的是为文件的每个可能的段(让我们称之为可变长度 1-64K 块段)计算出大量的压缩可能性。 如果我错了,请纠正我,但是您是否从以下选择中找出第一段的最佳压缩(方法 0 未压缩):
这将花费大量时间(每个段大约 420,000 次压缩尝试)。 如果这就是您正在做的事情,那么您最好选择单个段大小(例如 64K)并对其应用七种压缩方法中的每一种以选择最佳的。 然后,对于每个段,输出“方法”字节,后跟压缩数据。
It sounds like what you're trying to do is work out a large number of compression possibilities for every possible segment (let's call your variable length 1-64K blocks segments) of the file. Correct me if I'm wrong, but are you working out the best compression for the first segment from the following choices (method 0 is uncompressed):
That's going to take a huge amount of time (roughly 420,000 compression attempts per segment). If that is what you're doing, you'll be better off choosing a single segment size (e.g., 64K) and applying each of the seven compression methods to it to choose the best. Then, for each segment, output the "method" byte followed by the compressed data.