快速 rgb565 转 YUV(甚至 rgb565 转 Y)
我正在做一件事情,我希望有输出选项可以转到视频叠加。有的支持rgb565,如果这么好的话,就把数据复制过来吧。
如果不是,我必须通过转换来复制数据,并且它一次是一个帧缓冲区。我将尝试一些事情,但我认为这可能是优化者热衷于尝试一些挑战的事情之一。
通常最容易支持的各种 YUV 格式是 Y 平面,后跟交错或单独的 UV 平面。
使用 Linux / xv,但在我处理的级别上它只是字节和 x86。
我将专注于速度而牺牲质量,但可能有数百种不同的路径可供尝试。那里有一个平衡点。
我查看了mmx,但我不确定那里是否有任何有用的东西。没有什么让我觉得特别适合这项任务,而且需要进行大量的洗牌才能将东西放入寄存器中的正确位置。
尝试使用 Y = Green*0.5 + R*0.25 + Blue*notmuch 的粗略版本。 U 和 V 的质量就更不用担心了。在这些频道上谋杀你可以逃脱惩罚。
对于一个简单的循环。
loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop
当然,每条指令都取决于前一条指令,并且单词读取并不是最好的,因此交错两条指令可能会有所收获。
loop:
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop
一次使用 4 条指令会很容易做到这一点,也许是有好处的。
谁能想出更快、更好的办法吗?
一个有趣的方面是一个像样的编译器是否可以生成类似的代码。
I'm working on a thing where I want to have the output option to go to a video overlay. Some support rgb565, If so sweet, just copy the data across.
If not I have to copy data across with a conversion and it's a frame buffer at a time. I'm going to try a few things, but I thought this might be one of those things that optimisers would be keen on having a go at for a bit of a challenge.
There a variety of YUV formats that are commonly supported easiest would be the Plane of Y followed by either interleaved or individual planes of UV.
Using Linux / xv, but at the level I'm dealing with it's just bytes and an x86.
I'm going to focus on speed at the cost of quality, but there are potentially hundreds of different paths to try out. There's a balance in there somewhere.
I looked at mmx but I'm not sure if there is anything useful there. There's nothing that strikes me as particularly suited to the task and it's a lot of shuffling to get things into the right place in registers.
Trying a crude version with Y = Green*0.5 + R*0.25 + Blue*notmuch. The U and V are even less of a concern quality wise. You can get away with murder on those channels.
For a simple loop.
loop:
movzx eax,[esi]
add esi,2
shr eax,3
shr al,1
add ah,ah
add al,ah
mov [edi],al
add edi,1
dec count
jnz loop
of course every instruction depends on the one before and word reads aren't the best so interleaving two might gain a bit
loop:
mov eax,[esi]
add esi,4
mov ebx,eax
shr eax,3
shr ebx,19
add ah,ah
add bh,bh
add al,ah
add bl,bh
mov ah,bl
mov [edi],ax
add edi,2
dec count
jnz loop
It would be quite easy to do that with 4 at a time, maybe for a benefit.
Can anyone come up with anything faster, better?
An interesting side point to this is whether or not a decent compiler can produce similar code.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个像样的编译器,如果有适当的开关来调整最感兴趣的 CPU 变体,几乎肯定比任何凡人都更了解良好的 x86 指令选择和调度!
查看英特尔(R) 64 和 IA-32 架构优化参考手册...
如果您想手动优化代码,一个好的策略可能是让编译器为您生成汇编源代码作为起点,然后对其进行调整;在每次更改之前和之后进行分析,以确保您确实使事情变得更好。
A decent compiler, given the appropriate switches to tune for the CPU variants of most interest, almost certainly knows a lot more about good x86 instruction selection and scheduling than any mere mortal!
Take a look at the Intel(R) 64 and IA-32 Architectures Optimization Reference Manual...
If you want to get into hand-optimising code, a good strategy might be to get the compiler to generate assembly source for you as a starting point, and then tweak that; profile before and after every change to ensure that you're actually making things better.
我认为,您真正想要了解的是为此使用 MMX 或整数 SSE 指令。这将让您一次处理几个像素。我想如果您指定了正确的开关,您的编译器将能够生成这样的代码,特别是如果您的代码写得足够好的话。
关于您现有的代码,我不会费心交错不同迭代的指令来获得性能。所有 x86 处理器(Atom 除外)的乱序引擎和缓存应该可以很好地处理这个问题。
编辑:如果您需要进行水平添加,您可能需要使用
PHADDD
和PHADDW
指令。事实上,如果您有英特尔软件设计师手册,您应该查找PH*
说明。他们可能有你需要的东西。What you really want to look at, I think, is using MMX or the integer SSE instructions for this. That will let you work with a few pixels at a time. I imagine your compiler will be able to generate such code if you specify the correct switches, especially if your code is written nicely enough.
Regarding your existing codes, I wouldn't bother with interleaving instructions of different iterations to gain performance. The out-of-order engine of all x86 processors (excluding Atom) and the caches should handle that pretty well.
Edit: If you need to do horizontal adds you might want to use the
PHADDD
andPHADDW
instructions. In fact, if you have the Intel Software Designer's Manual, you should look for thePH*
instructions. They might have what you need.