当前位置：文江博客话题详情

优化字节对编码

发布于 2024-08-18 04:49:39 字数 327 浏览 2 评论 0原文

注意到大型文本压缩基准测试中严重缺乏字节对编码（BPE），我非常快速制作它的一个简单的文字实现。

考虑到没有进一步的处理，例如没有霍夫曼或算术编码，压缩率出奇地好。

然而，我这个微不足道的实现的运行时间并不那么出色。

如何对此进行优化？是否可以一次性完成？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

携余温的黄昏 2024-08-25 04:49:39

这是我迄今为止的进展总结：

谷歌搜索找到了这个链接到原始代码并引用来源的小报告：

Philip Gage，题为“一种新算法”
对于数据压缩'，出现了
在“C 用户日志”中 - 二月
1994年版。

多布斯博士网站上的代码链接已损坏，但该网页镜像了它们。

该代码使用哈希表来跟踪每次通过缓冲区时使用的有向图及其计数，以避免每次重新计算新的数据。

我的测试数据是来自哈特奖。

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 创建所有二合字母的列表；块的大小为 10KB，通常有 200 个左右的有向图高于阈值（4 个，这是我们可以通过压缩获得一个字节的最小数量）；对该列表进行排序并进行第一次替换。

随着替换的进行，统计数据也会更新；通常，每次传递仅更改大约 10 或 20 个二合字母；这些被“绘制”并排序，然后与有向图列表合并；这比每次遍历都对整个有向图列表进行排序要快得多，因为该列表几乎已排序。

原始代码在“tmp”和“buf”字节缓冲区之间移动； bpev3 只是交换缓冲区指针，这仅需要大约 10 秒的运行时间。

鉴于 bpev2 的缓冲区交换修复将使穷举搜索与哈希表版本保持一致；我认为哈希表的价值值得商榷，而列表对于这个问题来说是更好的结构。

但它仍然是多通道的。因此它不是一个普遍具有竞争力的算法。

如果您查看大文本压缩基准，就会发现原来的 bpe 已被添加。由于它的块大小较大，因此它在 enwik9 上的性能比我的 bpe 更好。此外，哈希表和我的列表之间的性能差距更接近 - 我将其归因于 LTCB 使用的 march=PentiumPro。

当然也有适合和使用的场合； Symbian使用它来压缩 ROM 映像中的页面。我推测 Thumb 二进制文件的 16 位性质使其成为一种简单且有益的方法；压缩在PC上完成，解压在设备上完成。

This is a summary of my progress so far:

Googling found this little report that links to the original code and cites the source:

Philip Gage, titled 'A New Algorithm
for Data Compression', that appeared
in 'The C Users Journal' - February
1994 edition.

The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.

That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.

My test data is enwik8 from the Hutter Prize.

|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2          | 1.24            | //The current version in the large text benchmark
| bpe_c          | 1.07            | //The original version by Gage, using a hashtable
| bpev3          | 0.25            | //Uses a list, custom sort, less memcpy
|----------------|-----------------|

bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.

As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.

The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.

Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.

Its sill multi-pass though. And so its not a generally competitive algorithm.

If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.

There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

回复收藏 0 原文

花开浅夏 2024-08-25 04:49:39

我已经完成了优化 LZF 压缩实现的工作，并且我用来提高性能的一些相同原则也可以在这里使用。

为了提高字节对编码的性能：

将块大小限制为小于 65kB（可能最佳为 8-16 kB）。这保证了并非所有字节都会被使用，并允许您在 RAM 中保存中间处理信息。
使用哈希表或简单的短整数查找表（更多 RAM，但速度更快）来保存字节对的计数。 有 65,656 个 2 字节对和 BlockSize 实例可能（最大块大小 64k）。这将为您提供一个包含 128k 可能输出的表。
分配和重用数据结构，能够在内存中保存完整的压缩块、替换表、字节对计数和输出字节。这听起来很浪费 RAM，但是当您考虑到您的块大小很小时，这是值得的。您的数据应该能够完全位于 CPU L2 或（最坏情况）L3 缓存中。这大大提高了速度。
快速传递数据以收集计数，然后担心创建替换表。
尽可能将字节打包为整数或短整数（主要适用于 C/C++）。计数表中的单个条目可以用一个整数（16 位计数，加上字节对）来表示。

回复收藏 0 原文

余厌 2024-08-25 04:49:39

JustBasic 中的代码可以在此处找到，并包含输入文本文件。

只是基本文件存档 – 论坛帖子

EBPE by TomC 02/ 2014 – 增强字节对编码

EBPE 对字节对编码

1 进行了两个后处理。正在压缩字典（被认为是一个新奇事物）

字典条目由 3 个字节组成：

AA – the two char to be replaced by (byte pair)
 1 – this single token (tokens are unused symbols)

所以 "AA1" 告诉我们每当我们在解码中看到 "1" 时
数据文件，将其替换为“AA”。

虽然连续令牌的长时间运行是可能的，但让我们看看这个
8 令牌示例：

AA1BB3CC4DD5EE6FF7GG8HH9

长度为 24 个字节 (8 * 3)

令牌 2 不在文件中，表明它不是打开的令牌
使用，或者换句话说：2 存在于原始数据中。

我们可以看到最后 7 个标记 3,4,5,6,7,8,9 是连续的，所以任何时候我们
看到 4 个或更多标记的顺序运行，让我们将字典修改为：

AA1BB3<255>CCDDEEFFGGHH<255>

其中 <255> 告诉我们字节对的标记是隐含的，并且
比我们看到的最后一个标记多增加 1 (3)。我们递增
加一，直到我们看到下一个指示运行结束的<255>。

原始字典为24字节，
增强字典为20字节。

我使用此增强功能在文本文件上节省了 175 个字节，其中标记
128 到 254 和其他一般一样按顺序排列，包括
通过小写预处理创建的运行。

2. 正在压缩数据文件

重新使用很少使用的字符作为标记并不是什么新鲜事。

使用所有符号进行压缩后（<255> 除外），
我们扫描文件并在文件中找到一个 "j"。让这个字符加倍
duty by:

"<255>j" 表示这是一个文字 "j"
"j" 现在用作 re 的标记-压缩，

如果j在数据文件中出现1次，我们需要添加1个<255>
和一个3字节的字典条目，所以我们需要在BPE中保存超过4个字节
这是值得的。

如果 j 出现 6 次，我们将需要 6 个 <255> 和一个 3 字节字典
因此我们需要在 BPE 中保存超过 9 个字节，这样才值得。

取决于是否可以进一步压缩以及剩余多少字节对
在该文件中，此后处理在测试运行中节省了超过 100 个字节。

注意：解压时请确保不要解压每个“j”。
需要查看前一个字符以确保它不是按顺序排列的 <255>
解压。最后，全部解压后，继续删除<255>的
重新创建原始文件。

3. EBPE 的下一步是什么？

目前未知

Code in JustBasic can be found here complete with input text file.

Just BASIC Files Archive – forum post

EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding

EBPE features two post processes to Byte Pair Encoding

1. Is compressing the dictionary (believed to be a novelty)

A dictionary entry is composed of 3 bytes:

AA – the two char to be replaced by (byte pair)
 1 – this single token (tokens are unused symbols)

So "AA1" tells us when decoding that every time we see a "1" in the
data file, replace it with "AA".

While long runs of sequential tokens are possible, let’s look at this
8 token example:

AA1BB3CC4DD5EE6FF7GG8HH9

It is 24 bytes long (8 * 3)

The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.

We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we
see a sequential run of 4 tokens or more, let’s modify our dictionary to be:

AA1BB3<255>CCDDEEFFGGHH<255>

Where the <255> tells us that the tokens for the byte pairs are implied and
are incremented by 1 more than the last token we saw (3). We increment
by one until we see the next <255> indicating an end of run.

The original dictionary was 24 bytes,
The enhanced dictionary is 20 bytes.

I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.

2. Is compressing the data file

Re-using rarely used characters as tokens is nothing new.

After using all of the symbols for compression (except for <255>),
we scan the file and find a single "j" in the file. Let this char do double
duty by:

"<255>j" means this is a literal "j"
"j" is now used as a token for re-compression,

If the j occurred 1 time in the data file, we would need to add 1 <255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.

If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary
entry so we need to save more than 9 bytes in BPE for this to be worth it.

Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.

Note: When decompressing make sure not to decompress every "j".
One needs to look at the prior character to make sure it is not a <255> in order
to decompress. Finally, after all decompression, go ahead and remove the <255>'s
to recreate your original file.

3. What’s next in EBPE?

Unknown at this time

回复收藏 0 原文

南薇 2024-08-25 04:49:39

我不相信这可以在一次传递中完成，除非您找到一种方法来预测，给定字节对替换，新的字节对（替换后）是否也适合替换。

这是我第一眼看到的想法。也许你已经这样做或已经想到了这一切。

我会尝试以下方法。

两个可调整参数：

在考虑替换数据块之前，数据块中字节对出现的次数。（这样字典的增长速度不会快于块的收缩速度。）在
可能不再值得替换之前经过的替换次数。（这样，当可能只剩下 1% 或 2% 的增益时，算法就不再浪费时间。）

只要仍然值得再压缩一级（根据参数 2），我就会进行传递。在每次传递期间，我都会记录字节对的计数。

我会稍微调整一下这两个参数，看看它如何影响压缩比和速度。也许它们应该根据要压缩的块的长度（也许还有其他一两个东西）动态地改变。

另一件需要考虑的事情是在传递过程中用于存储每个字节对计数的数据结构。很可能有一种方法可以编写比通用数据结构更快的自定义数据结构。

如果您尝试某件事并获得有趣的结果，请随时通知我们！

回复收藏 0 原文

美胚控场 2024-08-25 04:49:39

是的，请随时通知我们。

保证？

BobMcGee 给出了很好的建议。
但是，我怀疑“将块大小限制为小于 65kB ...。这保证并非所有字节都会被使用”并不总是正确的。
我可以生成一个长度小于 1kB 的（高度人工的）二进制文件，其中有一个重复 10 次的字节对，但根本无法使用 BPE 进行压缩，因为它使用了全部 256 个字节——BPE 没有可用的空闲字节表示频繁字节对。

如果我们将自己限制为 7 位 ASCII 文本，则我们有超过 127 个可用字节，因此所有重复字节对足够多次的文件都可以通过 BPE 至少进行一点压缩。
然而，即使如此，我也可以（人为地）生成一个仅使用 isgraph() ASCII 字符且长度小于 30kB 的文件，该文件最终达到 BPE 的“无可用字节”限制，即使仍然存在一个字节对重复4次以上。

单次

似乎这个算法可以稍微调整一下，以便一次性完成。
假设 7 位 ASCII 明文：
扫描输入文本，记住我们在某种内部数据结构中看到的所有字节对，以某种方式计算到目前为止我们看到的唯一字节对的数量，并将每个字节复制到输出（高位为零）。
每当我们遇到重复时，都会发出一个表示字节对的特殊字节（高位为 1，因此我们不会将文字字节与字节对混淆）。
将该特殊字节包含在字节“对”的内部列表中，以便压缩器稍后可以发出表示该特殊字节的其他特殊字节加上文字字节 - 因此该其他特殊字节的最终效果是表示一个三元组。
正如 Phkahler 指出的那样，这听起来实际上与 LZW 相同。

编辑：
显然，我上面提到的“没有可用字节”限制毕竟不是所有字节对压缩器的固有限制，因为至少存在一个没有该限制的字节对压缩器。

你见过吗
“SCZ - 简单压缩实用程序和库”？
SCZ 似乎是一种字节对编码器。
SCZ 显然比我见过的其他字节对压缩器提供更好的压缩，因为
SCZ 没有我上面提到的“无可用字节”限制。

如果任何字节对 BP 在明文中重复足够多的次数（或者，经过几轮迭代后，在部分压缩的文本中），
即使文本已包含所有 256 个字节，SCZ 也可以进行字节对压缩。

（SCZ 在压缩文本中使用特殊的转义字节 E，这表明后面的字节旨在按字面形式表示自身，而不是扩展为字节对。
这允许压缩文本中的某些字节 M 执行双重任务：
压缩文本中的两个字节EM代表明文中的M。
压缩文本中的字节M（没有前面的转义字节）表示纯文本中的某个字节对BP。
如果某个字节对 BP 在明文中出现的次数比 M 多，那么在压缩数据中将每个 BP 字节对表示为单个字节 M 所节省的空间将大于将每个 M 表示为两个字节所“损失”的空间EM。）

Yes, keep us posted.

guarantee?

BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.

If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.

single pass

It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.

EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.

Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.

If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.

(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)

回复收藏 0 原文