霍夫曼编码 - 标头和编码EOF

发布于 2024-12-16 21:27:23 字数 274 浏览 2 评论 0原文

我目前正在致力于用Java实现一个基于霍夫曼算法的程序,我正处于需要将编码内容输出到文件的阶段。我对如何实现解码所需的 header 和 eof 有点困惑。对于目前我的标题,我拥有输入文件中出现的所有唯一值及其频率,但在一些文章中,我看到人们用 0 或 1 表示节点,然后是频率(我对此有点困惑)因为它没有说明符号是什么)。

另外,对于我所理解的 EOF,我像符号一样对其进行编码,以便读取和解码它,但是我不确定我可以使用它的什么值,但肯定不会出现?我知道它的权重需要为 1,但不确定如何确保它实际上不会出现在文件中。

I am currently working on implementing a program based on the huffman algorithm in Java, and I am at the stage where I need to output the encoded content to a file. I am a bit confused about how to implement the header and eof needed for decoding. For my header at the moment I have all the unique values that occur from the input file and their frequency, but on some articles I have seen people do it with 0 or 1 represents the nodes and then the frequency (which I am a bit puzzled by as it doesn't say what the symbol is).

Also, for the EOF as I understand it I encode it like the symbols so it gets read and decoded, however I am not sure what value I can use for it that will definitely not come up? I know it needs to a weight of 1 but was unsure how to make sure it won't actually be in the file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

甜点 2024-12-23 21:27:23

我曾经在一次作业中必须这样做,这就是我们使用的方法。

标头编码是通过使用 0 和 1 来编码树的结构(而不​​是频率)来完成的。 “0”表示沿着树移动,“1”表示我们位于叶节点。这导致了一种对树进行独特编码的前序遍历。

例如,像 (((ab) c) (de)) 这样的树将被编码为“0001a1b1c01 d1e”,其中 a、b、c、d、e 是它们的 ASCII 形式。

这是图形形式的树:

     / \
   /\   /\
 /\  c d  e
a  b 

对于 EOF,我们使用文件中的最后 3 位来指定需要读取最后两个字节中的多少个。一旦我们读取了最后一个字节(所以我们正在处理倒数第二个字节),我们就检查了最后 3 位:它们编码了要读取的位数,减去 6。所以 110101xxxxxxx000 意味着“读取 110101(6 位)并丢弃其他所有内容”。 1101011xxxxxx001 表示“读取 1101011(7 位)并丢弃其余部分”等。

这样做意味着我们不必有一个表示 EOF 的特殊值,我们仍然可以一次读取文件一个字节(尽管我们实际上需要在工作之前读取一个字节)。

(对于 EOF 我还没有读过你的文章,所以我不知道我们的想法是否适合你的方法。)

I've had to do this once for an assignment and this is the approach we used.

Encoding the header was done by using 0 and 1 to encode the structure of the tree (rather than the frequencies). A "0" denoted moving along the tree, a "1" denoted we were at a leaf node. This resulted in a sort of pre-order traversal of the tree encoding it uniquely.

For example, a tree like (((a b) c) (d e)) would be encoded as "0001a1b1c01d1e", where a,b,c,d,e are their ASCII forms.

Here's the tree in a graphical form:

     / \
   /\   /\
 /\  c d  e
a  b 

For the EOF we used the last 3 bits in the file to specify how many of the last two bytes needed to be read. Once we read the last byte (so we were working on the second last byte) we checked the last 3 bits: They encoded how many more bits to read, minus 6. So 110101xxxxxxx000 meant "read 110101 (6 bits) and discard everything else". 1101011xxxxxx001 meant "read 1101011 (7 bits) and discard the rest", etc.

Doing it this way meant we didn't have to have a special value denoting the EOF and we could still read the file a byte at a time (although we actually needed to read one byte ahead of where we were working).

(For the EOF I haven't read your articles, so I don't know if our idea works with your approach.)

无名指的心愿 2024-12-23 21:27:23

霍夫曼编码指定如何从某些字符序列创建霍夫曼树,然后如何将其编码为位序列。

它没有指定如何对树进行编码或如何计算出要读取的确切位数。确切的位数是一个问题,因为您只能将整个字节保存到文件中。因此,您需要某种方法来准确确定在哪一位结束。

对于对树进行编码,有多种选择。其中之一是记录每个字符的计数,并让解码器据此重建树。其他选项是以某种方式直接对树进行编码,例如使用 mange(我假设您提到的文章)描述的 0-1 方法。

然后是自适应霍夫曼编码,它根本不需要树,但更复杂。

为了决定何时结束,您可以将字符总数写入文件并使用它来决定何时停止。或者,如果您使用字符计数对树进行编码,则可以免费获得此计数。

另一种选择是使用 EOF 字符。这是霍夫曼树中的一个字符,但不编码任何正常值。假设您正在编码字节,您可以将其想象为第 257 个值。 (您可以为 EOF 标记使用一些正常值,例如零字节,但这需要您绝对确定它不会出现在输入文件中。)

Huffman encoding specifies how to create the Huffman tree from some sequence of characters and then how to encode that into a sequence of bits.

It doesn't specify how should you encode the tree or how to figure out exactly how many bits to read. The exact count of bits is a problem, because you can save only whole bytes into a file. And so you need some way to figure out exactly at which bit to end.

For encoding the tree, there are several options. One of them is recording the count for each character and letting the decoder reconstruct the tree from that. Other option is to directly encode the tree somehow, for example using the 0-1 approach mange (and I assume the articles you mentioned) describes.

Then there is adaptive Huffman coding which doesn't require the tree at all, but is more complicated.

For deciding when to end, you could write the total count of characters into the file and use that to decide when to stop. Or you could get this count for free if you encoded the tree using character counts.

Another option is to use an EOF character. This is a character that is in your Huffman tree, but doesn't encode any normal value. You could imagine it as a 257th value, assuming you are encoding bytes. (You could use some normal value, like the zero byte, for the EOF token, but that would require you to be absolutely sure it won't be present in the input file.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文