霍夫曼《终结者》位串

发布于 2024-10-16 15:32:15 字数 2067 浏览 5 评论 0原文

动机

想象一个霍夫曼压缩文件被部分下载,就像在 P2P 软件中一样,所以我们首先为整个文件分配磁盘空间,然后开始下载随机文件块。其中一个霍夫曼代码(但我们不知道是哪一个)是结束代码,因此如果该代码被解码,我们就停止。假设该文件由多个霍夫曼压缩流组成,我们可以尝试在下载完成之前解压其中一些。

预分配磁盘空间的方式现在很重要:假设我们有一个霍夫曼流的开头,但它不完整,所以我们遇到了预分配的磁盘空间。通常,这个空间都是0,所以我们继续用哈夫曼码00..来解码符号。如果这不是我们的最终代码,我们不会注意到“无效”数据,并且如果有 2 GB 的预分配空间,我们会进行很多无用的解码。

因此,我们希望以一种使解码尽快停止的方式来预分配空间。

我正在寻找充当“霍夫曼终止符”的最短位串

,这意味着如果我们解码该字符串,我们将至少解码每个霍夫曼代码一次,因此我们肯定会收到一个结束代码。这应该适用于长度为 1..n 位的霍夫曼代码的每种组合。

注意:我知道上面的假设场景有简单的解决方案(使用 00.. 作为结束代码,使用 P2P 段数据来检测尚未下载的块),但这只是一个示例场景显示“霍夫曼终止符”位串的理论用途,我对解决这种情况不感兴趣,而是寻找算法/方法/想法来生成/查找充当“霍夫曼终止符”的位串。

示例

让我们看看 n = 2 可能的霍夫曼编码组合:[0, 1][00, 01, 1]、<代码>[0, 10, 11] ,<代码>[00, 01, 10, 11] 。现在让我们从一个包含长度为 1..n (0, 1, 00) 的每个可能位序列的位串开始code>, 01, 10, 11):

001011

使用不同的霍夫曼代码组合解码得到 (霍​​夫曼代码被分配给符号 A..D):

Combination   Decoded symbols
[0, 1]        AABABB
[00,01,1]     ACBC
[0,10,11]     AABC
[00,01,10,11] ACD

这是一个好的开始,它已经解码了前三个的每个霍夫曼代码,但是如果我们用 [00, 01, 10 解码它, 11],我们缺少符号 B(霍夫曼代码 01)。因此,我们将其附加到我们的位串中:

00101101

这是 n=2 的有效“霍夫曼终止符”,长度为 8 位。如果我们用这个字节预先分配磁盘空间,我们一定会终止所有不超过 2 位的霍夫曼代码。我们甚至知道 n=2 不会有更短的终止符字符串,因为它是组合 [00, 01, 10, 11] 解码每个符号的最小长度一次。

我还找到了 n=3 的“Huffman 终结符”,0001011001110100111010011100010101111101110 (43 位),但我不能 100% 确定它是否正确,我也不知道如果是最短的那一个。

我正在寻找

  • 算法/想法来查找或生成给定n的霍夫曼终止符。我的尝试与示例类似:生成起始字符串并根据需要附加位以满足所有不同的霍夫曼代码组合。但我确信有更好的方法。

  • 特定的霍夫曼终止符,n=8n=16,尽管生成它们的计算成本可能很高,而且它们可能会很长。

  • 关于此问题(或类似问题)的论文/链接(如果有)。

如果我们从位位置 1..n 开始,找到“霍夫曼终止符”的奖励

积分也会起作用,因此如果数据之前已解码,它甚至会终止,并且我们不会到达并启动新的霍夫曼第一位的代码。

Motivation

Imagine a huffman compressed file that gets downloaded partially, like in P2P software, so we allocate disk space for the whole file first and then start downloading random file chunks. One of the huffman codes (but we don't know which) is an end code, so if this code is decoded, we stop. Assuming the file consists of several huffman compressed streams, we can try to decompress some of them before the download finished.

The way the disk space is preallocated is important now: Let's assume we have the start of a huffman stream, but it isn't complete, so we run into the preallocated disk space. Usually, this space is all 0, so we'd keep decoding symbols with huffman code 00... If this isn't our end code, we don't notice the "invalid" data and f.e. if there's 2 GB of preallocated space, we're doing much useless decoding.

So we'd like to preallocate the space in a way to make decoding stop as soon as possible.

The question

I'm looking for the shortest bitstring that acts as a "Huffman terminator", meaning if we decode this string, we'll decode every Huffman code at least once, so we'll definitely receive an end code. This should work for every combination of Huffman codes of length 1..n bits.

Note: I know there are simple solutions to the hypothetical scenario above (using 00.. as an end code, using P2P segment data to detect chunks not downloaded yet), but this is just an example scenario to show a theoretical use of a "Huffman terminator" bitstring, I'm not interested in solving this scenario, but looking for algorithms/ways/ideas to generate/find bitstrings that act as "Huffman terminator" .

Example

Let's look at the possible Huffman code combinations for n = 2: [0, 1], [00, 01, 1], [0, 10, 11], [00, 01, 10, 11]. Now let's start with a bitstring that contains every possible bit sequence of length 1..n (0, 1, 00, 01, 10, 11):

001011

Decoding with the different Huffman code combinations gives (huffman codes are assigned to symbols A..D):

Combination   Decoded symbols
[0, 1]        AABABB
[00,01,1]     ACBC
[0,10,11]     AABC
[00,01,10,11] ACD

This is a good start and it already decodes every Huffman code for the first three ones, but if we decode it with [00, 01, 10, 11], we're missing symbol B (Huffman code 01). So let's just append this to our bitstring:

00101101

This is a valid "Huffman terminator" for n=2, 8 bits in length. If we'd preallocate our disk space with this byte, we'd be sure to terminate all huffman codes that won't exceed 2 bits. We even know there won't be shorter terminator strings for n=2 because it's the minimal length for the combination [00, 01, 10, 11] to decode each symbol once.

I also found a "Huffman terminator" for n=3, 0001011001110100111010011100010101111101110 (43 bits), but I'm not 100% sure if it's correct and I don't know if it's the shortest one.

What I'm looking for

  • Algorithms/ideas to find or generate Huffman terminators for a given n. My attempt would be similar to the example: generating a start string and appending bits as needed to satisfy all different Huffman code combinations. But I'm sure there are better ways.

  • Specific Huffman terminators, n=8 and n=16, although it might be computational expensive to generate them and they will perhaps be very long.

  • Papers/links about this problem (or similar ones) if there are any.

Bonus

Bonus points for finding "Huffman terminators" that also work if we start at bit position 1..n, so it even terminates if data was decoded before and we won't arrive and start a new huffman code at the first bit.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

昨迟人 2024-10-23 15:32:15

如果我理解正确,那么最多 n 位霍夫曼代码的任何通用终止符都需要至少 n * 2^n 位,因为可能存在 2^n 个源符号(包括未知终止符号) ),每个符号出现的概率相同,因此每个符号都需要一个 n 位代码。这也告诉您任何此类最小长度通用终止符都将是这些 n 位 2^n 块的排列。

例如,对于 n=16,通用终止符不能短于 1048576 位或 128Kb。 (当然,可能需要更长的时间。)

If I understand correctly, then any universal terminator for an up-to-n-bit Huffman code requires at least n * 2^n bits, since it could be the case that there are 2^n source symbols (including the unknown termination symbol) that each occur with equal probability, thus necessitating an n-bit code for each symbol. This also tells you that any such minimal-length universal terminator will be a permutation of these 2^n blocks of n bits.

So for example for n=16, no universal terminator can be shorter than 1048576 bits, or 128Kb. (And of course it may need to be much longer.)

赤濁 2024-10-23 15:32:15

也许您不应该在这种情况下使用霍夫曼。

或者更好地跟踪哪些片段已下载(未下载)。

Maybe you shouldn't use Huffman in this scenario.

Or keep better track of what segments are (not) downloaded.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文