是否可以在GPU中实现Huffman解码?
我们有一个用霍夫曼编码编码的数据库。这里的目的是将其及其关联的解码器复制到 GPU 上;然后在 GPU 上对数据库进行解码,并在解码后的数据库上执行操作,而无需将其复制回 CPU 上。
我还远远不是霍夫曼专家,但我所知道的少数人表明,它似乎是一种本质上基于控制结构的算法。有了基础算法,恐怕会有很多序列化操作。
我的两个问题是:
- 您是否知道是否存在用于霍夫曼编码的有效 GPU 版本?
- 如果没有,您认为是否存在适用于 GPU 的霍夫曼算法(即控制结构较少)。或者您可能知道(并且您可以提供参考)高效的霍夫曼解码在 GPU 上效率不高。
我看到了其他限制,但它们并不重要: - GPU 无法非常有效地处理树:二叉树可以存储在经典数组中 - 工作量可能很难平衡:我们会看到之后
We have a database encoded with Huffman coding. The aim here is to copy on the GPU it with its associated decoder; then on the GPU, decod the database and do stuff on this decoded database without copying back it on the CPU.
I am far to be a Huffman specialist, but the few I know shows that it seems to be an algorithm essentially based on control structures. With the basic algorithm, I am afraid that there will be a lot of serialized operations.
My 2 questions are:
- do you know if there exists any efficient GPU version for Huffman coding
- if not, do you think there exists a Huffman algorithm which be adapted on GPU (ie. with less control structures). Or maybe you know (and you could provide a reference) that efficient Huffman decoding can not be efficient on GPU.
I see other constraints, but they are not critical:
- GPU could not be very efficient to handle tree: binary tree can be stored in a classical array
- workload could be difficult to balance: we'll see after
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
霍夫曼编码的问题是不能快进。即:你必须逐位、线性地解码。
因此,它对于并行性来说并不理想。
如果您可以决定编码,则可以完美地逐块编码,以便能够独立解码每个块。
The problem with Huffman coding is that you can't fast-forward. ie: you have to decode bit by bit, linearly.
As such it's not ideal for parallelism.
If you can decide on the encoding, you could perfectly encode chunk by chunk so as to be able to decode each chunk independently.
我对 GPU 上的霍夫曼不可能这一明显共识感到惊讶。
我引用这句格言:“如果它发生了,它一定是可能的”。
(各种归因于阿加莎·克里斯蒂、阿尔伯特·爱因斯坦等)
既然 SuperXero 在 GPU 上实现了霍夫曼,我想这一定是可能的。
首次执行后 CPU 霍夫曼压缩速度更快? (SuperXero)
Google:GPU huffman 解压缩
I am astonished by the apparent consensus that Huffman on a GPU is impossible.
I appeal to the aphorism: "If it happens, it must be possible".
(variously attributed to Agatha Christie, Albert Einstein, etc.)
Since SuperXero does Huffman on a GPU, I suppose it must be possible.
CPU huffman compression faster after first execution? (SuperXero)
Google: GPU huffman decompression
是的,您可以并行进行霍夫曼解码,这样您就可以在 GPU 中获得优势 - 只要内存不是问题。
在下面的讨论中,我将讨论霍夫曼树和霍夫曼输出 - 输出是需要在霍夫曼树中查找以进行解码的压缩符号。
霍夫曼算法要求您有一个用于解码的霍夫曼树 - 该树可以很大。您可以通过使用适合 GPU 本地内存的小型哈夫曼树来解决这个问题 - 但这会影响算法的压缩效率。例如,您可以将树限制为 GPU 处理器允许的最佳 2^n 个节点。 (例如,使用一棵限制为 1024 个节点的树。
如果您不限制哈夫曼树,以便您可以在每个 GPU 上的本地存储中容纳一个副本,那么您将无法真正获得您期望的并行性,因为所有 GPU 处理器都会被阻止访问内存所有读取相同的共享树。
霍夫曼输出符号被打包在可变数量的位中,如果您从输出的中间开始,则无法知道您是否在符号边界上。例如,在输出中,您可以强制每个 x 个单词的符号对齐,然后您就知道可以开始对输出中的任何 x 个单词进行解码,并将该块发送到。 GPU 处理节点,以及适当的树,
您不必只使用一棵树 - 但每个块一棵树也可能是过度的,也就是说,如果每个块只有一棵树,您将严重降低压缩效率。如果块很小,
那么您可以尝试查看块的相似性并使用同一棵树对相似的块进行编码,并为每个块存储一个树索引。例如,输出中可能有 10000 个块,但只有 50 个 1024 节点树。然后,您将一个块和一棵树发送到每个 GPU 处理节点以并行解码。
提高速度的关键是每个 GPU 处理节点仅在本地内存上工作。
Yes you can do huffman decoding in parallel and so you can get advantages in a GPU - provided memory is not an issue.
For the discussion below I'll talk about the huffman tree, and the huffman output - the output are the compressed symbols that need to be looked up in the huffman tree to be decoded.
The huffman algorithm requires that you have a huffman tree for decoding - that tree can be large. You can get around this by using a small huffman tree that fits on local memory in a GPU - but this will affect the compression efficiency of the algorithm. E.g. you could limit the tree to the best 2^n nodes for as much as your gpu processors allow. ( e.g. use a tree limited to say 1024 nodes.
If you don't limit the huffman tree such that you can fit one copy in local storage on each gpu then you won't really get the parallelism you expect because all the gpu processors will be blocked accessing memory all reading the same shared tree.
The huffman output the symbols are packed in a variable number of bits. There is no way if you start in the middle of the output to know if you are on a symbol boudary. But you can create your own boundaries. For example in the output you could just force the alignment of symbols every x words to be word aligned. Then you know that you can start decoding on any multiple of x word in the output, and send that block to a GPU processing node, along with the appropriate tree.
You don't have to use just one tree- but one tree per block may be overkill also. That is if you have one tree per block you'll severly cut into the compression efficiency if the blocks are small.
So you can try looking at the similarity of blocks and encode similar blocks with the same tree, and store a tree index per block. E.g. you may have 10000 blocks in the output but just 50 1024-node trees. Then you send one block, and one tree to each GPU processing node to decode in parallel.
The key to making it fast is that each GPU processing node works on local memory only.