快速24位数组-> 32位数组转换?
快速摘要:
我有一个 24 位值的数组。关于如何快速将各个 24 位数组元素扩展为 32 位元素,有什么建议吗?
详细信息:
我正在使用 DirectX 10 中的像素着色器实时处理传入的视频帧。一个绊脚石是我的帧来自 24 位捕获硬件像素(YUV 或 RGB 图像),但 DX10 采用 32 位像素纹理。因此,我必须将 24 位值扩展为 32 位,然后才能将它们加载到 GPU 中。
我真的不在乎我将剩余的 8 位设置为什么,或者传入的 24 位在该 32 位值中的位置 - 我可以在像素着色器中修复所有这些问题。但我需要非常快速地从 24 位转换为 32 位。
我对 SIMD SSE 操作不是很熟悉,但从我粗略的看去,我似乎无法使用它们进行扩展,因为我的读取和写入大小不同。有什么建议吗?或者我是否卡住了按顺序处理该数据集?
这感觉非常愚蠢 - 我正在使用像素着色器来实现并行性,但在此之前我必须执行顺序的每像素操作。我一定遗漏了一些明显的东西......
Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU.
I really don't care what I set the remaining 8 bits to, or where the incoming 24-bits are in that 32-bit value - I can fix all that in a pixel shader. But I need to do the conversion from 24-bit to 32-bit really quickly.
I'm not terribly familiar with SIMD SSE operations, but from my cursory glance it doesn't look like I can do the expansion using them, given my reads and writes aren't the same size. Any suggestions? Or am I stuck sequentially massaging this data set?
This feels so very silly - I'm using the pixel shaders for parallelism, but I have to do a sequential per-pixel operation before that. I must be missing something obvious...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
下面的代码应该相当快。它在每次迭代中复制 4 个像素,仅使用 32 位读/写指令。源指针和目标指针应与 32 位对齐。
编辑:
以下是使用 SSSE3 指令 PSHUFB 和 PALIGNR 执行此操作的方法。该代码是使用编译器内部函数编写的,但如果需要的话,将其转换为汇编应该不难。它在每次迭代中复制 16 个像素。源和目标指针必须对齐到16字节,否则会出错。如果它们未对齐,您可以通过将
_mm_load_si128
替换为_mm_loadu_si128
并将_mm_store_si128
替换为_mm_storeu_si128
来使其工作,但这会比较慢。SSSE3(不要与 SSE3 混淆)需要相对较新的处理器:Core 2 或更新版本,我相信 AMD 还不支持它。仅使用 SSE2 指令执行此操作将需要更多操作,并且可能不值得。
The code below should be pretty fast. It copies 4 pixels in each iteration, using only 32-bit read/write instructions. The source and destination pointers should be aligned to 32 bits.
Edit:
Here is a way to do this using the SSSE3 instructions PSHUFB and PALIGNR. The code is written using compiler intrinsics, but it shouldn't be hard to translate to assembly if needed. It copies 16 pixels in each iteration. The source and destination pointers Must be aligned to 16 bytes, or it will fault. If they aren't aligned, you can make it work by replacing
_mm_load_si128
with_mm_loadu_si128
and_mm_store_si128
with_mm_storeu_si128
, but this will be slower.SSSE3 (not to be confused with SSE3) will require a relatively new processor: Core 2 or newer, and I believe AMD doesn't support it yet. Performing this with SSE2 instructions only will take a lot more operations, and may not be worth it.
SSE3 很棒,但对于那些因某种原因无法使用它的人来说,这里是 x86 汇编器中的转换,真正由您手工优化。为了完整起见,我给出了两个方向的转换:RGB32->RGB24 和 RGB24->RGB32。
请注意,interjay 的 C 代码在目标像素的 MSB(Alpha 通道)中留下了垃圾。这在某些应用程序中可能并不重要,但在我的应用程序中很重要,因此我的 RGB24->RGB32 代码强制 MSB 为零。同样,我的 RGB32->RGB24 代码忽略了 MSB;如果源数据具有非零 Alpha 通道,这可以避免垃圾输出。正如基准测试所验证的,这些功能在性能方面几乎没有任何成本。
对于 RGB32->RGB24,我能够击败 VC++ 优化器约 20%。对于 RGB24->RGB32,增益微不足道。基准测试是在 i5 2500K 上进行的。我在这里省略了基准测试代码,但如果有人需要,我会提供它。最重要的优化是尽快碰撞源指针(请参阅 ASAP 评论)。我最好的猜测是,这通过允许指令管道更快地预取来增加并行性。除此之外,我只是重新排序了一些指令,以减少依赖性并将内存访问与位攻击重叠。
当我们这样做时,这里是实际 SSE3 汇编中的相同转换。仅当您有汇编器(FASM 是免费的)并且有支持 SSE3 的 CPU(可能但最好检查一下)时,这才有效。请注意,内在函数不一定会输出如此有效的结果,它完全取决于您使用的工具以及您要编译的平台。在这里,很简单:所见即所得。此代码生成与上面的 x86 代码相同的输出,并且速度大约快 1.5 倍(在 i5 2500K 上)。
SSE3 is awesome, but for those who can't use it for whatever reason, here's the conversion in x86 assembler, hand-optimized by yours truly. For completeness, I give the conversion in both directions: RGB32->RGB24 and RGB24->RGB32.
Note that interjay's C code leaves trash in the MSB (the alpha channel) of the destination pixels. This might not matter in some applications, but it matters in mine, hence my RGB24->RGB32 code forces the MSB to zero. Similarly, my RGB32->RGB24 code ignores the MSB; this avoids garbage output if the source data has a non-zero alpha channel. These features cost almost nothing in terms of performance, as verified by benchmarks.
For RGB32->RGB24 I was able to beat the VC++ optimizer by about 20%. For RGB24->RGB32 the gain was insignificant. Benchmarking was done on an i5 2500K. I omit the benchmarking code here, but if anyone wants it I'll provide it. The most important optimization was bumping the source pointer as soon as possible (see the ASAP comment). My best guess is that this increases parallelism by allowing the instruction pipeline to prefetch sooner. Other than that I just reordered some instructions to reduce dependencies and overlap memory accesses with bit-bashing.
And while we're at it, here are the same conversions in actual SSE3 assembly. This only works if you have an assembler (FASM is free) and have a CPU that supports SSE3 (likely but it's better to check). Note that the intrinsics don't necessarily output something this efficient, it totally depends on the tools you use and what platform you're compiling for. Here, it's straightforward: what you see is what you get. This code generates the same output as the x86 code above, and it's about 1.5x faster (on an i5 2500K).
不同的输入/输出大小并不是使用 simd 的障碍,只是一个减速带。您需要对数据进行分块,以便以完整的 simd 字(16 字节)进行读写。
在这种情况下,您将读取 3 个 SIMD 字(48 字节 == 16 rgb 像素),进行扩展,然后写入 4 个 SIMD 字。
我只是说您可以使用SIMD,而不是说您应该。中间的部分,即扩展,仍然很棘手,因为单词的不同部分的移位大小不一致。
The different input/output sizes are not a barrier to using simd, just a speed bump. You would need to chunk the data so that you read and write in full simd words (16 bytes).
In this case, you would read 3 SIMD words (48 bytes == 16 rgb pixels), do the expansion, then write 4 SIMD words.
I'm just saying you can use SIMD, I'm not saying you should. The middle bit, the expansion, is still tricky since you have non-uniform shift sizes in different parts of the word.
SSE 4.1 .ASM:
SSE 4.1 .ASM: