优化字节对编码
注意到大型文本压缩基准测试中严重缺乏字节对编码(BPE),我非常快速制作它的一个简单的文字实现。
考虑到没有进一步的处理,例如没有霍夫曼或算术编码,压缩率出奇地好。
然而,我这个微不足道的实现的运行时间并不那么出色。
如何对此进行优化?是否可以一次性完成?
Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这是我迄今为止的进展总结:
谷歌搜索 找到了这个 链接到原始代码并引用来源的小报告:
多布斯博士网站上的代码链接已损坏,但该网页镜像了它们。
该代码使用哈希表来跟踪每次通过缓冲区时使用的有向图及其计数,以避免每次重新计算新的数据。
我的测试数据是来自 哈特奖。
bpev3 创建所有二合字母的列表;块的大小为 10KB,通常有 200 个左右的有向图高于阈值(4 个,这是我们可以通过压缩获得一个字节的最小数量);对该列表进行排序并进行第一次替换。
随着替换的进行,统计数据也会更新;通常,每次传递仅更改大约 10 或 20 个二合字母;这些被“绘制”并排序,然后与有向图列表合并;这比每次遍历都对整个有向图列表进行排序要快得多,因为该列表几乎已排序。
原始代码在“tmp”和“buf”字节缓冲区之间移动; bpev3 只是交换缓冲区指针,这仅需要大约 10 秒的运行时间。
鉴于 bpev2 的缓冲区交换修复将使穷举搜索与哈希表版本保持一致;我认为哈希表的价值值得商榷,而列表对于这个问题来说是更好的结构。
但它仍然是多通道的。因此它不是一个普遍具有竞争力的算法。
如果您查看大文本压缩基准,就会发现原来的 bpe 已被添加。由于它的块大小较大,因此它在 enwik9 上的性能比我的 bpe 更好。此外,哈希表和我的列表之间的性能差距更接近 - 我将其归因于 LTCB 使用的
march=PentiumPro
。当然也有适合和使用的场合; Symbian使用它来压缩 ROM 映像中的页面。我推测 Thumb 二进制文件的 16 位性质使其成为一种简单且有益的方法;压缩在PC上完成,解压在设备上完成。
This is a summary of my progress so far:
Googling found this little report that links to the original code and cites the source:
The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.
That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.
My test data is enwik8 from the Hutter Prize.
bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.
As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.
The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.
Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.
Its sill multi-pass though. And so its not a generally competitive algorithm.
If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the
march=PentiumPro
that the LTCB uses.There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.
我已经完成了优化 LZF 压缩实现的工作,并且我用来提高性能的一些相同原则也可以在这里使用。
为了提高字节对编码的性能:
BlockSize
实例可能(最大块大小 64k)。这将为您提供一个包含 128k 可能输出的表。I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.
To speed up performance on byte-pair encoding:
BlockSize
instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.JustBasic 中的代码可以在此处找到,并包含输入文本文件。
只是基本文件存档 – 论坛帖子
EBPE by TomC 02/ 2014 – 增强字节对编码
EBPE 对字节对编码
1 进行了两个后处理。正在压缩字典(被认为是一个新奇事物)
字典条目由 3 个字节组成:
所以
"AA1"
告诉我们每当我们在解码中看到"1"
时数据文件,将其替换为
“AA”
。虽然连续令牌的长时间运行是可能的,但让我们看看这个
8 令牌示例:
长度为 24 个字节 (8 * 3)
令牌 2 不在文件中,表明它不是打开的令牌
使用,或者换句话说:2 存在于原始数据中。
我们可以看到最后 7 个标记
3,4,5,6,7,8,9
是连续的,所以任何时候我们看到 4 个或更多标记的顺序运行,让我们将字典修改为:
其中
<255>
告诉我们字节对的标记是隐含的,并且比我们看到的最后一个标记多增加
1
(3
)。我们递增加一,直到我们看到下一个指示运行结束的
<255>
。我使用此增强功能在文本文件上节省了 175 个字节,其中标记
128 到 254 和其他一般一样按顺序排列,包括
通过小写预处理创建的运行。
2. 正在压缩数据文件
重新使用很少使用的字符作为标记并不是什么新鲜事。
使用所有符号进行压缩后(
<255>
除外),我们扫描文件并在文件中找到一个
"j"
。让这个字符加倍duty by:
"<255>j"
表示这是一个文字"j"
"j"
现在用作 re 的标记-压缩,如果
j
在数据文件中出现1次,我们需要添加1个<255>
和一个3字节的字典条目,所以我们需要在BPE中保存超过4个字节
这是值得的。
如果
j
出现 6 次,我们将需要 6 个<255>
和一个 3 字节字典因此我们需要在 BPE 中保存超过 9 个字节,这样才值得。
取决于是否可以进一步压缩以及剩余多少字节对
在该文件中,此后处理在测试运行中节省了超过 100 个字节。
注意:解压时请确保不要解压每个
“j”
。需要查看前一个字符以确保它不是按顺序排列的
<255>
解压。最后,全部解压后,继续删除
<255>
的重新创建原始文件。
3. EBPE 的下一步是什么?
目前未知
Code in JustBasic can be found here complete with input text file.
Just BASIC Files Archive – forum post
EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding
EBPE features two post processes to Byte Pair Encoding
1. Is compressing the dictionary (believed to be a novelty)
A dictionary entry is composed of 3 bytes:
So
"AA1"
tells us when decoding that every time we see a"1"
in thedata file, replace it with
"AA"
.While long runs of sequential tokens are possible, let’s look at this
8 token example:
It is 24 bytes long (8 * 3)
The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.
We can see the last 7 tokens
3,4,5,6,7,8,9
are sequential so any time wesee a sequential run of 4 tokens or more, let’s modify our dictionary to be:
Where the
<255>
tells us that the tokens for the byte pairs are implied andare incremented by
1
more than the last token we saw (3
). We incrementby one until we see the next
<255>
indicating an end of run.I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.
2. Is compressing the data file
Re-using rarely used characters as tokens is nothing new.
After using all of the symbols for compression (except for
<255>
),we scan the file and find a single
"j"
in the file. Let this char do doubleduty by:
"<255>j"
means this is a literal"j"
"j"
is now used as a token for re-compression,If the
j
occurred 1 time in the data file, we would need to add 1<255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.
If the
j
occurred 6 times we would need 6<255>
and a 3 byte dictionaryentry so we need to save more than 9 bytes in BPE for this to be worth it.
Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.
Note: When decompressing make sure not to decompress every
"j"
.One needs to look at the prior character to make sure it is not a
<255>
in orderto decompress. Finally, after all decompression, go ahead and remove the
<255>
'sto recreate your original file.
3. What’s next in EBPE?
Unknown at this time
我不相信这可以在一次传递中完成,除非您找到一种方法来预测,给定字节对替换,新的字节对(替换后)是否也适合替换。
这是我第一眼看到的想法。也许你已经这样做或已经想到了这一切。
我会尝试以下方法。
两个可调整参数:
只要仍然值得再压缩一级(根据参数 2),我就会进行传递。在每次传递期间,我都会记录字节对的计数。
我会稍微调整一下这两个参数,看看它如何影响压缩比和速度。也许它们应该根据要压缩的块的长度(也许还有其他一两个东西)动态地改变。
另一件需要考虑的事情是在传递过程中用于存储每个字节对计数的数据结构。很可能有一种方法可以编写比通用数据结构更快的自定义数据结构。
如果您尝试某件事并获得有趣的结果,请随时通知我们!
I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!
是的,请随时通知我们。
保证?
BobMcGee 给出了很好的建议。
但是,我怀疑“将块大小限制为小于 65kB ...。这保证并非所有字节都会被使用”并不总是正确的。
我可以生成一个长度小于 1kB 的(高度人工的)二进制文件,其中有一个重复 10 次的字节对,但根本无法使用 BPE 进行压缩,因为它使用了全部 256 个字节——BPE 没有可用的空闲字节表示频繁字节对。
如果我们将自己限制为 7 位 ASCII 文本,则我们有超过 127 个可用字节,因此所有重复字节对足够多次的文件都可以通过 BPE 至少进行一点压缩。
然而,即使如此,我也可以(人为地)生成一个仅使用 isgraph() ASCII 字符且长度小于 30kB 的文件,该文件最终达到 BPE 的“无可用字节”限制,即使仍然存在一个字节对重复4次以上。
单次
似乎这个算法可以稍微调整一下,以便一次性完成。
假设 7 位 ASCII 明文:
扫描输入文本,记住我们在某种内部数据结构中看到的所有字节对,以某种方式计算到目前为止我们看到的唯一字节对的数量,并将每个字节复制到输出(高位为零)。
每当我们遇到重复时,都会发出一个表示字节对的特殊字节(高位为 1,因此我们不会将文字字节与字节对混淆)。
将该特殊字节包含在字节“对”的内部列表中,以便压缩器稍后可以发出表示该特殊字节的其他特殊字节加上文字字节 - 因此该其他特殊字节的最终效果是表示一个三元组。
正如 Phkahler 指出的那样,这听起来实际上与 LZW 相同。
编辑:
显然,我上面提到的“没有可用字节”限制毕竟不是所有字节对压缩器的固有限制,因为至少存在一个没有该限制的字节对压缩器。
你见过吗
“SCZ - 简单压缩实用程序和库”?
SCZ 似乎是一种字节对编码器。
SCZ 显然比我见过的其他字节对 压缩器 提供更好的压缩,因为
SCZ 没有我上面提到的“无可用字节”限制。
如果任何字节对 BP 在明文中重复足够多的次数(或者,经过几轮迭代后,在部分压缩的文本中),
即使文本已包含所有 256 个字节,SCZ 也可以进行字节对压缩。
(SCZ 在压缩文本中使用特殊的转义字节 E,这表明后面的字节旨在按字面形式表示自身,而不是扩展为字节对。
这允许压缩文本中的某些字节 M 执行双重任务:
压缩文本中的两个字节EM代表明文中的M。
压缩文本中的字节M(没有前面的转义字节)表示纯文本中的某个字节对BP。
如果某个字节对 BP 在明文中出现的次数比 M 多,那么在压缩数据中将每个 BP 字节对表示为单个字节 M 所节省的空间将大于将每个 M 表示为两个字节所“损失”的空间EM。)
Yes, keep us posted.
guarantee?
BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.
EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)
您还可以优化字典,以便:
AA1BB2CC3DD4EE5FF6GG7HH8
是 8 个令牌的顺序运行。将其重写为:
AA1<255>BBCCDDEEFGGHH<255>
,其中<255>
告诉程序以下每个字节对(直到下一个<255>
) 是连续的并且加一。非常适合文本文件以及任何至少有 4 个连续标记的地方。
在最近的测试中节省了 175 字节。
You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8
is a sequential run of 8 token.Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255>
where the<255>
tells the program that each of the following byte pairs (up to the next<255>
) are sequential and incremented by one. Works great for textfiles and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.
这是一个新的 BPE(http://encode.ru/threads/1874-Alba)。
编译示例,
gcc -O1 alba.c -o alba.exe
它比默认的要快。
Here is a new BPE(http://encode.ru/threads/1874-Alba).
Example for compile,
gcc -O1 alba.c -o alba.exe
It's faster than default.
有一个 O(n) 版本的字节对编码,我在此处描述了它。我在 Java 中获得了约 200kB/秒的压缩速度。
There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.
最简单有效的结构是像 byte_pair(255,255) 这样的二维数组。将计数放入其中并在文件压缩时进行修改。
the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.