如何获得 bzip2 发现的模式? (或任何其他压缩算法)
我有一个由字符“0”、“1”、“2”、“3”组成的巨大文件。没有空格,也没有其他任何东西。就这4个字。我使用 bzip2 对其进行压缩,文件大小从 X 减小到 0.05*X。我想知道 bzip2 找到的用于实现文件压缩版本的字符串/模式是什么(例如 0123213232、0123121212222112 等)。有没有一种简单的方法可以从实际的 bz2 文件中提取该信息,或者通过使用某些特殊的命令行选项运行 bzip2 来提取该信息?
如果您知道其他一些现有压缩程序的答案,我也有兴趣了解一下。
感谢您的任何帮助。
最好的, 苏里卡托。
I have a huge file made up of the characters "0", "1", "2", "3". No spaces, not anything else. Just these 4 characters. I have used bzip2 to compress it and the file decreased size from X to 0.05*X. I'd like to know what were the strings/patterns that bzip2 found to achieve that compressed version of the file (e.g. 0123213232, 0123121212222112, etc.). Is there a simple way to extract that information either from the actual bz2 file or by running bzip2 with some special command-line option?
If you know the answer for some other existing compression program, I'll also be interested in hearing about that.
Thanks for any help.
Best,
Surikator.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Bzip2 使用 Burrows-Wheeler 变换以可逆的方式将重复的字节序列转换为相同字节的序列。然后它使用 move-to-front 算法将重复字节转换为零序列。之后,它使用 huffmann 编码 将较短的符号分配给更频繁的字节(可能是零)。您可以在维基百科页面上找到更多详细信息。
Bzip2 uses Burrows-Wheeler transform to turn repeated byte sequences into sequences of the same byte in a reversible way. Then it uses the move-to-front algorithm to transform repeated bytes to zero sequences. After that it uses huffmann coding to assign shorter symbols to more frequent bytes (probably the zeros). You can find more details on the wikipedia page.
bzip2 没有这个选项,而且它的工作原理并不像我认为的那样。无论如何,您都可以找到算法中各个部分的代码。正如 @stribika 提到的,它使用 Burrows-Wheeler 并移动到前端算法,然后再通过霍夫曼编码器进行泵送。 Google 应该会以您选择的语言为您提供一些 Burrow's Wheeler 变换的结果。
但是,根据您正在寻找的内容,我认为您需要更多字典样式的编码器。您可能对 LZW 算法感兴趣:
http:// en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
它将建立一个像你展示的字符串字典。
bzip2 doesn't have an option for this, and it doesn't work exactly like I think you think it works. Regardless, you can find code for the various pieces in the algorithm. As @stribika mentioned, it uses the Burrows-Wheeler and move to front algorithms before pumping it through a Huffman encoder. Google should get you some results for a Burrow's Wheeler transform in the language of your choice.
However, based on what you're looking for, I think you want more of a dictionary style encoder. You might be interested in an LZW algorithm:
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
It will build up a dictionary of strings like you showed.
Burrows-Wheeler 变换
也称为块排序。如果您不喜欢阅读维基百科,请阅读 1999 年计算机科学数学基础:http://books.google.ee/books?id=OcJjpqAi15EC&pg=PA34&lpg=PA34&dq=mathematica+Burrows%E2%80 %93Wheeler+transform&source=bl&ots=KaOOIPJcKC&sig=5PzHG9UQeg3opr1FUMq8mPAxfn4&hl=et&ei=Y6vPTLfVFsqCOozvvPcE&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBMQ 6AEwAA#v=onepage&q& ;f=false
霍夫曼编码
对于输入:
“这是霍夫曼树的示例”
。构建这样的二叉树:然后用于构建编码表:
新的二进制文件只能被读取,如果你有相同的树,所以它也在输出中得到支持。还存储数据的长度,因为新二进制的总和不是完整的字节数。
打开软件
您可以阅读
Burrows-Wheeler transform
It is also called block-sorting. If you do not like reading Wikipedia, then read Mathematical foundations of computer science 1999: http://books.google.ee/books?id=OcJjpqAi15EC&pg=PA34&lpg=PA34&dq=mathematica+Burrows%E2%80%93Wheeler+transform&source=bl&ots=KaOOIPJcKC&sig=5PzHG9UQeg3opr1FUMq8mPAxfn4&hl=et&ei=Y6vPTLfVFsqCOozvvPcE&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBMQ6AEwAA#v=onepage&q&f=false
Huffman coding
For a input of:
"this is an example of a huffman tree"
. Binary tree like this is built:It is then used to build coding table:
New binary can only be read, if you have the same tree, so that is also backed in output. Also length of the data is store, because sum of new binary's is not full byte number.
Open software
You can just read