霍夫曼编码的字节频率表
我正在编写一个霍夫曼压缩器和解压缩器(用 C++ 编写),需要处理任意二进制文件。我需要一些数据结构建议。现在,我的压缩过程如下:
- 以二进制形式将文件的字节读取到 char* 缓冲区
- 使用 std::map 来计算文件中每个字节模式的频率。 (我认为这是我自找麻烦的地方。)
- 根据频率直方图构建二叉树。每个内部节点都有其子节点的频率总和,每个叶节点都有一个 char* 来表示实际字节。
到目前为止,这就是我所处的位置。
我的问题是,如果我只使用从 char* 到 int 的映射,我到底要测量什么。如果我是对的,这实际上不是我需要的。我认为我真正做的是使用 char* 跟踪实际的 4 字节指针值。
因此,我打算做的是使用映射作为直方图,使用字符作为存储在叶节点的数据。我的逻辑合理吗?我的推理告诉我是的,但由于这是我第一次处理二进制数据,我想小心那些只会以奇怪的方式出现的陷阱。
谢谢。
I'm writing a huffman compressor and decompressor (in C++) that needs to work on arbitrary binary files. I need a bit of data structure advice. Right now, my compression process is as follows:
- Read the bytes of the file in binary form to a char* buffer
- Use an std::map to count the frequencies of each byte pattern in the file.
(This is where I think I'm asking for trouble.) - Build the binary tree based on the frequency histogram. Each internal node has the sum of the frequencies of its children and each leaf node has a char* to represent the actual byte.
This is where I'm at so far.
My question is what exactly I'm measuring if I just use a map from char* to int. If I'm correct, this isn't actually what I need. What I think I'm really doing is tracking the actual 4-byte pointer values by using char*.
So, what I plan to do is use a map for the histogram and a char for the data stored at leaf nodes. Is my logic sound here? My reasoning tells me yes, but since this is my first time dealing with binary data, I'd like to be careful of pitfalls that will only show up in strange ways.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你不需要地图;只有 256 个可能的值。只需使用
int freq[256] = {0}
并为输入中的每个字节添加freq[data[idx]]++
即可。如果您确实想要地图,请使用
map
;您对使用char*
中的地图的怀疑是正确的。You don't need a map; there are only 256 possible values. Just have
int freq[256] = {0}
and add to it withfreq[data[idx]]++
for each byte in the input.If you REALLY want a map, use
map<unsigned char, int>
; your suspicion on using map fromchar*
is correct.