快速 RGB => OpenCL 中的 YUV 转换
我知道可以使用以下公式将 RGB 图像转换为 YUV 图像。下式中,R、G、B、Y、U、V均为8位无符号整数,中间值为16位无符号整数。
Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16
U = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128
V = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128
但当在 OpenCL 中使用该公式时,情况就不同了。
1. 8 位内存写访问是可选扩展,这意味着某些 OpenCL 实现可能不支持它。
2. 即使支持上述扩展,与 32 位写访问相比还是慢得要命。
为了获得更好的性能,每4个像素将同时处理,因此输入是12个8位整数,输出是3个32位无符号整数(第一个代表4个Y样本,第二个代表4个Y样本)代表 4 U 样本,最后一个代表 4 V 样本)。
我的问题是如何直接从12个8位整数中得到这3个32位整数?有没有公式可以得到这3个32位整数,或者我只需要使用旧公式得到12个8位整数结果(4 Y,4 U,4 V)并用bit构造3个32位整数- 明智的操作?
I know the following formula can be used to convert RGB images to YUV images. In the following formula, R, G, B, Y, U, V are all 8-bit unsigned integers, and intermediate values are 16-bit unsigned integers.
Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16
U = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128
V = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128
But when the formula is used in OpenCL it's a different story.
1. 8-bit memory write access is an optional extension, which means some OpenCL implementations may not support it.
2. even the above extension is supported, it's deadly slow compared with 32-bit write access.
In order to get better performance, every 4 pixels will be processed at the same time, so the input is 12 8-bit integers and the output is 3 32-bit unsigned integers(the first one stands for 4 Y samples, the second one stands for 4 U samples, the last one stands for 4 V samples).
My question is how to get these 3 32-bit integers directly from the 12 8-bit integers? Is there a formula to get these 3 32-bit integers, or I just need to use the old formula to get 12 8-bit integer results(4 Y, 4 U, 4 V) and construct the 3 32-bit integers with bit-wise operation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尽管这个问题是两年前提出的,但我认为一些工作代码在这里会有帮助。考虑到最初担心直接访问 8 位值时性能较差,因此最好尽可能执行 32 位直接访问。
不久前,我开发并使用了以下 OpenCL 内核将 ARGB(典型的 Windows 位图像素布局)转换为 y 平面(全尺寸)、u/v 半平面(四分之一尺寸)内存布局作为 libx264 的输入编码。
此代码仅执行全局 32 位内存访问,而每个工作项内发生 8 位处理。
哦..以及调用内核的正确代码
注意:看看工作项计算。需要添加一些额外的代码(例如使用 mod 来添加足够的备用项)以确保工作项大小适合本地工作大小。
Even though this question was asked 2 years ago, i think some working code would help here. In terms of the initial concerns about bad performance when directly accessing 8-bit values, it's better to perform 32-bit direct access when possible.
Some time ago I've developed and used the following OpenCL kernel to convert ARGB (typical windows bitmap pixel layout) to the y-plane (full sized), u/v-half-plane (quarter sized) memory layout as input for libx264 encoding.
This code performs only global 32-bit memory access while 8-bit processing happens within each work item.
Oh.. and the proper code to invoke the kernel
Note: have a look at the work item calculations. Some additional code needs to be added (e.g. using mod so as to add sufficient spare items) to make sure that work item sizes fit to local work sizes.
像这样?除非您的平台可以使用 int3,否则请使用 int4。此外,您还可以将 5 个像素打包到 int16 中,这样您就浪费了 1/16 而不是 1/4 的内存带宽。
Like this? Use int4 unless your platform can use int3. Also you can pack 5 pixels into an int16 so you are wasting 1/16 instead of 1/4 of the memory bandwidth.
连同 opencl 规范 数据类型 int3 不存在。
第 123 页:
在内核变量中
rgb
、R
、G
、B
和yuv
应至少为__private int4
。OpenCL 1.1 添加了对
typen
的支持,其中n = 3
。但是,我强烈建议您不要使用它。不同的供应商实现有不同的错误,并且它不会为您节省任何东西。Along with opencl specification data type int3 doesn't exists.
Page 123:
In your kernel variables
rgb
,R
,G
,B
, andyuv
should be at least__private int4
.OpenCL 1.1 added support for
typen
wheren = 3
. However, I strongly recommend you don't use it. Different vendor implementations have different bugs, and it's not saving you anything.