使用 Nvidia CUDA 的压缩库
Does anyone know a project which implements standard compression methods (like Zip, GZip, BZip2, LZMA,...) using NVIDIA's CUDA library?
I was wondering if algorithms which can make use of a lot of parallel tasks (like compression) wouldn't run much faster on a graphics card than with a dual or quadcore CPU.
What do you think about the pros and cons of such an approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我们已经完成了第一阶段的研究,以提高无损数据压缩算法的性能。
选择 Bzip2 作为原型,我们的团队只优化了一项操作 - Burrows–Wheeler 转换,我们得到了一些结果:良好的可压缩文件速度提高了 2-4 倍。 该代码在我们所有的测试中运行得更快。
我们将完成 bzip2,支持 deflate 和 LZMA,以完成一些现实生活中的任务,例如:HTTP 流量和备份压缩。
博客链接:
http://www .wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx
We have finished first phase of research to increase performance of lossless data compression algorithms.
Bzip2 was chosen for the prototype, our team optimized only one operation - Burrows–Wheeler transformation, and we got some results: 2x-4x speed up on good compressible files. The code works faster on all our tests.
We are going to complete bzip2, support deflate and LZMA for some real life tasks like: HTTP traffic and backups compression.
blog link:
http://www.wave-access.com/public_en/blog/2011/april/22/breakthrough-in-cuda-data-compression.aspx
不知道有人这样做并将其公开。 恕我直言,这听起来不太有希望。
正如 Martinus 指出的,一些压缩算法是高度串行的。 像 LZW 这样的块压缩算法可以通过独立编码每个块来并行化。 压缩大型文件树可以在文件级别并行化。
然而,这些都不是真正的 SIMD 式并行(单指令多数据),并且它们不是大规模并行。
GPU 基本上是矢量处理器,您可以在锁步中执行数百或数千条 ADD 指令,并执行数据相关分支很少的程序。
一般来说,压缩算法听起来更像是 SPMD(单程序多数据)或 MIMD(多指令多数据)编程模型,更适合多核 cpu。
视频压缩算法可以像 CUDA 一样通过 GPGPU 处理加速,前提是有大量像素块并行进行余弦变换或卷积(用于运动检测),并且可以表示 IDCT 或卷积子例程使用无分支代码。
GPU 还喜欢具有高数值强度(数学运算与内存访问的比率)的算法。具有低数值强度的算法(例如添加两个向量)可以大规模并行和 SIMD,但在 GPU 上的运行速度仍然比在 CPU 上慢,因为它们是内存限制。
Not aware of anyone having done that and made it public. Just IMHO, it doesn't sound very promising.
As Martinus points out, some compression algorithms are highly serial. Block compression algorithms like LZW can be parallelized by coding each block independently. Ziping a large tree of files can be parallelized at the file level.
However, none of these is really SIMD-style parallelism (Single Instruction Multiple Data), and they're not massively parallel.
GPUs are basically vector processors, where you can be doing hundreds or thousands of ADD instructions all in lock step, and executing programs where there are very few data-dependent branches.
Compression algorithms in general sound more like an SPMD (Single Program Multiple Data) or MIMD (Multiple Instruction Multiple Data) programming model, which is better suited to multicore cpus.
Video compression algorithms can be accellerated by GPGPU processing like CUDA only to the extent that there is a very large number of pixel blocks that are being cosine-transformed or convolved (for motion detection) in parallel, and the IDCT or convolution subroutines can be expressed with branchless code.
GPUs also like algorithms that have high numeric intensity (the ratio of math operations to memory accesses.) Algorithms with low numeric intensity (like adding two vectors) can be massively parallel and SIMD, but still run slower on the gpu than the cpu because they're memory bound.
通常压缩算法不能利用并行任务,使算法高度并行化并不容易。 在您的示例中,TAR 不是压缩算法,唯一可能高度并行化的算法是 BZIP,因为它是块压缩算法。 每个块都可以单独压缩,但这需要大量的内存。 LZMA 也不能并行工作,当您看到 7zip 使用多个线程时,这是因为 7zip 将数据流拆分为 2 个不同的流,每个流都在单独的线程中使用 LZMA 进行压缩,因此压缩算法本身不是并行的。 仅当数据允许时,这种拆分才有效。
Typically compression algorithms cannot make use of parallel tasks, it is not easy to make the algorithms highly parallelizeable. In your examples, TAR is not a compression algorithm, and the only algorithm that might be highly parallelizeable is BZIP because it is a block compression algorithm. Each block can be compressed separately, but this would require lots and lots of memory. LZMA does not work in parallel either, when you see 7zip using multiple threads this is because 7zip splits the data stream into 2 different streams that each are compressed with LZMA in a separate thread, so the compression algorithm itself is not paralllel. This splitting only works when the data permits this.
加密算法在这一领域非常成功,因此您可能想研究一下。 这是一篇与 CUDA 和 AES 加密相关的论文:http://www.manavski.com/downloads/PID505889.pdf
Encryption algorithms have been quite successful in this area, so you might want to look into that. Here is a paper related to CUDA and AES encryption:http://www.manavski.com/downloads/PID505889.pdf
我们正在尝试将 bzip2 移植到 CUDA。 :) 到目前为止(仅完成了粗略测试),我们的 Burrows-Wheeler 变换比串行算法快 30%。 http://bzip2.github.com
We're making an attempt at porting bzip2 to CUDA. :) So far (and with only rough tests done), our Burrows-Wheeler Transform is 30% faster than the serial algorithm. http://bzip2.github.com
30% 已经不错了,但对于备份等应用程序来说,这远远不够。
我的经验是,在这种情况下,平均数据流使用 gzip 获得 1.2-1.7:1 的压缩率,最终输出速率限制为 30-60Mb/s(这是在广泛的现代(大约 2010-2012 年)介质中实现的) - 高端 CPU。
。
不幸的是,为了保持 LTO5 磁带驱动器的正常运行,它需要一个原始(不可压缩) 数据速率约为 160Mb/s。如果提供可压缩数据,则需要更快的数据速率,
LTO 压缩显然要快得多,但效率较低(相当于 gzip -1 - 通常足以满足大多数用途)。内置 AES-256 加密引擎,也可以保持这些类型的速度,
这对我来说意味着我需要 400% 或更好的改进才能认为它值得
在 30Mb 上应用。 /s,压缩是 Gb 级网络的一个障碍,问题是是否要在网络上或压缩上花费更多......:)
30% is nice, but for applications like backups it's not enough by a long shot.
My experience is that the average data stream in such instances gets 1.2-1.7:1 compression using gzip and ends up limited to an output rate of 30-60Mb/s (this is across a wide range of modern (circa 2010-2012) medium-high-end CPUs.
The limitation here is usually the speed at which data can be fed into the CPU itself.
Unfortunately, in order to keep a LTO5 tape drive happy, it needs a raw (uncompressable) data rate of around 160Mb/s. If fed compressable data it requires even faster data rates.
LTO compression is clearly a lot faster, but somewhat inefficient (equivalent to gzip -1 - it's good enough for most purposes). LTO4 drives and upwards usually have built in AES-256 encryption engines which can also maintain these kinds of speeds.
What this means for my case is that I'd need a 400% or better impreovement in order to consider it worthwhile.
Similar considerations apply across LANs. At 30Mb/s, compression is a hinderance on Gb-class networks and the question is whether to spend more on networking or on compression... :)