压缩少量数据
我有一个程序,可以生成大约 80 到 150 位左右的比特流,我想对其进行压缩,因为我要将它们转换成某种 ASCII 字符串,以便人们可以传输它们。
有谁知道有一个好的、免费的位感知压缩器可以在这样的流上工作吗? 我对“标准选项”的主要问题是这个流实际上应该被视为位,而不是字节,否则结构就会丢失,并且它们的开销会淹没任何增益。
添加:
我想压缩这些流的原因是因为用户将剪切+粘贴它们,可能使用诸如base64编码之类的东西,所以保存一些数据是有帮助的。
这是一个例子,供那些想看的人参考。 我将添加格式以使其更易于阅读:
110 110 - This is a 6x6 grid (the maximum is 7x7, so we only need 3 bits!)
000000
011110
010010
010010
011110
000000 - This is one layout grid
000000
000000
001000
000100
000000
000000 - This is the second layout grid
现在我们列出一些片段
010 11111111 - A piece is a 3-bit colour code, then an 8-bit list of 'on / off' bits.
001 10101010 - Another bit!
001 10101010 - Another, identical bit!
我说这应该被视为“位”的原因是,当将其视为比特流时,存在明显的压缩选项(特别是,通常在'grid's),当您将其视为字节流时,它就会消失。
I have a program where I generate bitstreams, of about 80 to 150 bits or so, which I would like to compress, because I'm going to turn them into some kind of ASCII string so people can transmit them around.
Does anyone know of a good, free bit-aware compressor that might work on such a stream? My main problem with the "standard options" is this stream should really be treated as bits, not bytes, else the structure is lost, and their overhead swamps any gain.
Addition:
The reason I want to compress these streams is because users are going to be cutting+pasting them, probably using something like base64 encoding, so saving some data is helpful.
Here is an example, for those who would like to see it. I'll add formatting to make it easier to read:
110 110 - This is a 6x6 grid (the maximum is 7x7, so we only need 3 bits!)
000000
011110
010010
010010
011110
000000 - This is one layout grid
000000
000000
001000
000100
000000
000000 - This is the second layout grid
Now we list some pieces
010 11111111 - A piece is a 3-bit colour code, then an 8-bit list of 'on / off' bits.
001 10101010 - Another bit!
001 10101010 - Another, identical bit!
The reason I say this should be considered 'as bits' is that there is obvious compression options when viewed as a bitstream (in particular, usually many 0s in the 'grid's), which disappear when you consider it as a byte-stream.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
您希望通过压缩 150 位来实现什么目的? 除非您聚合这 19b 消息中的几条,否则我不确定您希望获得什么。 这是一个用户界面问题吗——您希望用户发送/接收“代码”?
base 64 编码怎么样? 这将获取二进制数据并将其转换为编码字符以便于传输或输入。
What are you hoping to accomplish by compressing 150 bits? Unless you aggregate several of this 19b messages, I'm not sure what you hope to gain. Is it a UI issue--wherein you want your users to send/receive "codes"?
How about base 64 encoding? This will take binary data and turn it into coded characters for easy transmission or entry.
克里斯,感谢您发布这些样本。 我认为游程编码是你想要的方式。 实施起来应该非常简单。
http://en.wikipedia.org/wiki/Run-length_encoding
可以很好地配合所有那些连续的 0。
那么压缩这些字符串的主要原因是为了让它们更容易剪切和粘贴? 说得通; 这听起来是一个有趣的项目。
如果您只是想让字符串更易于管理,那么听起来您已经准备好了。 如果您尝试压缩它们,以便它们通过网络传输得更快,我认为压缩小字符串的好处可能会被其他 TCP 问题(例如 MTU 大小等)所抵消。 (我没有这方面的经验,所以对最后一点持保留态度)
祝你好运!
Chris, thanks for posting those samples. I think run-length encoding is the way you want to go. That should be pretty trivial to implement.
http://en.wikipedia.org/wiki/Run-length_encoding
Will work well with all those consecutive 0's.
So the primary reason to compress these strings is to make them easier to cut and paste? Makes sense; that sounds like an interesting project.
If you are just trying to make the strings more human-manageable it sounds like you're all set. If you are trying to compress them so that they transmit faster over the wire I think the benefit of compressing small strings may be defeated by other TCP issues like MTU sizes and all that. (I'm not experienced there, so take that last bit with many grains of salt)
Good luck!
我的第一个建议是您查看范围编码。 <
/
您可以将位直接打包到 0-
N
范围内(其中
N code> 是您使用的可打印字符数减 1),然后进行简单的映射。
我的第二个建议是你研究 PNG 使用的过滤方法,并考虑是否可以使用类似的方法来使你的数据更具可压缩性。 仅从两个示例布局网格中很难看出,但从您的第一个网格中似乎很可能采用某种方法,例如“根据其上方和左侧的邻居来预测每个像素,然后如果每个像素满足其要求,则将其转换为 0”预测,如果违背预测则为 1”可以为您提供一组更加统一的数据,从而实现更大的压缩。
My first suggestion is that you look into range encoding. Instead of
1: compressing from bit data into binary data and then
2: encoding binary data into base64 ASCII data,
you could pack your bits directly into the range 0-
N
(whereN
is the number of printable characters you are using minus 1) and then do a dead-easy mapping.My second suggestion is that you study the filter methods employed by PNG and think about whether similar methods could be used to render your data more compressible. It's difficult to tell from just two sample layout grids, but it seems very likely from your first grid that some method such as "predict each pixel based on its neighbors above and to the left, and then convert each pixel to 0 if it met its prediction and 1 if it defied its prediction" could give you a much more uniform set of data, and thus greater compression.
我猜想没有通用算法可以为此类数据提供很好的压缩。
最好的选择是分析数据的结构并尝试找到一种自定义压缩算法或可能自定义现有的算法(可能使用预先填充的字典或类似的东西)。
I'd guess that no general purpose algorithm will give you great compression for this kind of data.
Your best bet is to analyze the structure of your data and try to find a custom compression algorithm or possibly customize an existing one (maybe with a pre-filled dictionary or something like that).
由于流很小,您可以在这里发布一些吗?
另外,您确定这些流中有足够的冗余以允许压缩吗? 是否存在重复的数据块?
这个可能性不大,但在没有任何具体答案的情况下,您可能想研究一下 ROM 场景,看看在基于卡带的 RPG 游戏(如“时空之轮”或“最终幻想 III”)中文本字符串是如何压缩的。 ” 我知道这些游戏中的文本字符串是被压缩的(当时字节非常宝贵),并且破解该计划对于黑客来说是一个有趣的挑战。 当你提到很多短小的字符串被压缩时,这是我想到的唯一的事情。
不过,你的根本问题可能仍然存在。 我想这些 ROM 中的压缩方案会利用多个字符串中的冗余(即,如果“Timbuktu”出现在 58 个不同的字符串中),而不是在单个流中利用冗余。
Since the streams are so small, can you post some of them here?
Also are you sure that there is enough redundancy in those streams to even allow compression? Are there any repeating blocks of data?
It's kind of a longshot, but in the absence of any concrete answers, you might want to look into the ROM scene and check out how strings of text were compressed in cartridge-based RPG games like "Chrono Trigger" or "Final Fantasy III." I know that the text strings were compressed in those games (bytes were so precious in those days) and unraveling the scheme proved a fun challenge for hackers. That's the only thing that came to my mind when you mentioned lots of short little strings being compressed.
Your root problem might remain, though. I would imagine that the compression schemes in those ROMs exploited redundancy across many strings (ie, if "Timbuktu" occurred in 58 different strings) and not so much within a single stream.
CCITT 的 Group 3 和 Group 4 无损编码方案,用于压缩 G3 和 G4 TIFF 在设计时考虑了二进制数据。 G4 TIFF 是黑白图像,通常用于 OCR 和传真。 我想到的另一个简单方案是 RLE。
CCITT's Group 3 and Group 4 lossless encoding schemes, used in compressing G3 and G4 TIFFs, were designed with binary data in mind. G4 TIFFs are black and white images usually used for OCR-ing and faxes. Another simple scheme that comes to mind is RLE.
我建议您考虑使用 zlib。 它是可下载的,并且许可证允许您将它用于几乎任何项目。 重要的一点是它被广泛使用,因此调试得很好。 如果您的数据很重要,您不希望将来必须在随机日期调试普通程序算法中的奇怪边缘情况。
我对它进行了一些修改,它确实允许面向流的压缩。 不过,我不确定当你一次只向它提供少量数据时它有多好。 无损压缩往往通过查找和消除模式来工作,如果您一次输入 12 个字节这样的小数据,就不会找到很多模式。
我不会赞同 Juan 的回答,因为他还建议使用 GIF,这是一种有损压缩。 您没有提供太多信息,但我猜您不想要任何实际上丢失数据的压缩格式。 大多数流行的图形、音频和视频压缩算法都是有损的; 它们依靠人类感官正确接收图像或声音的能力,并删除或稍微修改一些原始信息。
I would suggest you look into using zlib. It is downloadable, and the license allows you to use it for pretty much any project. An important point is that it is widely used, and thus well debugged. If your data is important, you don't want to have to debug odd edge cases in a hombrew algorithm at random dates in the future.
I've messed around with it a bit, and it does allow a stream-oriented compression. I'm not real sure how good it is when you are just feeding it a small amount of data at a time though. Loss-less compression tends to work by finding and eliminating patterns, and there won't be a lot of patterns to find if you are feeding it something small like 12 bytes at a time.
I'm not voing Juan's answer up because he also suggests using GIF which is a lossy compression. You didn't give a lot of info, but I'm guessing you don't want any compression format that actually looses data. Most popular graphic, audio, and video compression algrithms are lossy; they rely on the ability of human senses to take in an image or sound properly with some of the original information removed or modified slightly.
JBIG 可能会满足您的需求。
http://en.wikipedia.org/wiki/JBIG
http://www.jpeg.org/jbig/index.html
http://www.cl.cam.ac.uk/~mgk25/jbigkit/
JBIG用于压缩1-bpp 传真图像。
JBIG might give you what you need.
http://en.wikipedia.org/wiki/JBIG
http://www.jpeg.org/jbig/index.html
http://www.cl.cam.ac.uk/~mgk25/jbigkit/
JBIG is used to compress 1-bpp fax images.
zlib 压缩(可能与 gzip 相同的算法)是免费的。 它有一些设置,但我不确定你可以节省多少,除非你的位有一些周期性模式。
由于 png 和 gif 图形文件本质上是位模式的表示,也许您可以找到它们使用的压缩算法。
The zlib compression (maybe the same algorithm as gzip) is free. It has a few settings, but I am not sure how much you can save, unless there is some periodic pattern to your bits.
Since the png and gif graphics files are essentially representations of bit patterns, perhaps you can find the compression algorithm they use.
您想要的是无损二进制压缩。 我确信即使没有大量其他资源,也有论文或网络文章。 谷歌这些条款,我怀疑你会得到你需要的。
你说的是多少数据? 您的管道是否太小或吞吐量太高以至于必须压缩?
回想起来,您的数据非常小,除非您分析流量并进行自己的“压缩”,否则您可能不会获得有价值的收益,这基本上只是已知位模式的映射/哈希。
正如其他人所说,发布一些示例数据,之后可能会有更好的建议。
What you want is lossless binary compression. I am sure there are papers or web articles if not tons of other resources out there. Google those terms and i suspect you will get what you need.
How much data are you talking about? Is your pipe small or the throughput so high that you have to compress?
In retrospect, your data is so small that you are probably not going to get worthwhile gains unless you analyze your traffic and do your own "compression" which is basically just a mapping/hash of known bit patterns.
as someone else said, post some sample data and there is probably better advice after that.
我和蒂姆有同样的想法 - 如此少量的数据似乎不值得压缩。 事实上,我建议您真正想要研究的是某种 ascii 编码方法,例如 uuencode 或 mime-encode(又名“Base64")。
I've had the same thought as Tim - such a small amount of data barely seems worth compressing. As a matter of fact, I'd suggest that what you really want to look into is some sort of ascii encoding method, like uuencode or mime-encode (aka "Base64").
补充一下已经说过的内容,“压缩少量数据”本质上是不是有点毫无意义? 如果您能详细说明可能有帮助的数据、平台或用途。
至于位与 ascii - 我不完全确定你在说什么,但正如 Michael 提到的,Base64 提供了一种使任意二进制更加友好的方法。
请注意,任何从二进制到 ascii 的转换都与压缩相反。
Just to add to what's already been said, isn't "compressing a small amount of data" intrinsically a bit pointless? If you could elaborate on the data, the platform or the uses that might help.
As for the bits vs ascii - I'm not entirely sure what you're getting at, but as mentioned by Michael, Base64 provides a way to make arbitrary binary more friendly.
Note that any conversion from binary into ascii is the opposite of compression.