通过翻译复制内存的快速方法 - ARGB 到 BGR

发布于 2024-11-26 13:05:53 字数 1194 浏览 12 评论 0原文

概述

我有一个图像缓冲区，需要将其转换为另一种格式。原始图像缓冲区有四个通道，每个通道 8 位，Alpha、Red、Green 和 Blue。目标缓冲区是三个通道，每个通道 8 位，蓝色、绿色和红色。

所以暴力方法是：

// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

ARGB orig[IMAGESIZE];
BGR  dest[IMAGESIZE];

for(x = 0; x < IMAGESIZE; x++)
{
     dest[x].Red = orig[x].Red;
     dest[x].Green = orig[x].Green;
     dest[x].Blue = orig[x].Blue;
}

但是，我需要比循环和三字节副本提供的速度更快的速度。鉴于我在 32 位机器上运行，我希望可以使用一些技巧来减少内存读写次数。

附加信息

每个图像至少是 4 像素的倍数。因此，我们可以对 16 个 ARGB 字节进行寻址，并将它们移动到每个循环的 12 个 RGB 字节中。也许这个事实可以用来加快速度，特别是当它很好地落入 32 位边界时。

我可以访问 OpenCL - 虽然这需要将整个缓冲区移动到 GPU 内存中，然后将结果移回，但事实上 OpenCL 可以同时处理图像的许多部分，而且大内存块移动实际上是相当有效可能会使这成为一次值得探索的事情。

虽然我在上面给出了小缓冲区的示例，但我实际上正在移动高清视频 (1920x1080)，有时会移动更大（大多数情况下更小）的缓冲区，因此虽然 32x32 的情况可能微不足道，但逐字节复制 8.3MB 的图像数据是真的非常糟糕。

在 Intel 处理器（Core 2 及更高版本）上运行，因此我知道存在流和数据处理命令，但不知道 - 也许在哪里寻找专门的数据处理指令的指针会很好。

这是一个 OS X 应用程序，我使用的是 XCode 4。如果组装是轻松的并且是显而易见的方法，那么我很好沿着这条路走，但之前没有在这个设置上完成过它让我警惕投入太多时间。

伪代码很好 - 我不是在寻找完整的解决方案，只是在寻找算法和对任何可能无法立即清楚的技巧的解释。

原文

Overview

I have an image buffer that I need to convert to another format. The origin image buffer is four channels, 8 bits per channel, Alpha, Red, Green, and Blue. The destination buffer is three channels, 8 bits per channel, Blue, Green, and Red.

So the brute force method is:

// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

ARGB orig[IMAGESIZE];
BGR  dest[IMAGESIZE];

for(x = 0; x < IMAGESIZE; x++)
{
     dest[x].Red = orig[x].Red;
     dest[x].Green = orig[x].Green;
     dest[x].Blue = orig[x].Blue;
}

However, I need more speed than is provided by a loop and three byte copies. I'm hoping there might be a few tricks I can use to reduce the number of memory reads and writes, given that I'm running on a 32 bit machine.

Additional info

Every image is a multiple of at least 4 pixels. So we could address 16 ARGB bytes and move them into 12 RGB bytes per loop. Perhaps this fact can be used to speed things up, especially as it falls nicely into 32 bit boundaries.

I have access to OpenCL - and while that requires moving the entire buffer into the GPU memory, then moving the result back out, the fact that OpenCL can work on many portions of the image simultaneously, and the fact that large memory block moves are actually quite efficient may make this a worthwhile exploration.

While I've given the example of small buffers above, I really am moving HD video (1920x1080) and sometimes larger, mostly smaller, buffers around, so while a 32x32 situation may be trivial, copying 8.3MB of image data byte by byte is really, really bad.

Running on Intel processors (Core 2 and above) and thus there are streaming and data processing commands I'm aware exist, but don't know about - perhaps pointers on where to look for specialized data handling instructions would be good.

This is going into an OS X application, and I'm using XCode 4. If assembly is painless and the obvious way to go, I'm fine traveling down that path, but not having done it on this setup before makes me wary of sinking too much time into it.

Pseudo-code is fine - I'm not looking for a complete solution, just the algorithm and an explanation of any trickery that might not be immediately clear.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴梦一场 2024-12-03 13:05:53

我写了 4 个不同的版本，它们通过交换字节来工作。我使用 gcc 4.2.1 和 -O3 -mssse3 编译它们，在超过 32MB 的随机数据上运行它们 10 次，并找到平均值。

编者注：原始的内联汇编使用了不安全的约束，例如修改仅输入操作数，并且不告诉编译器寄存器中的指针输入所指向的内存的副作用。显然这对于基准测试来说效果很好。我修复了限制，以确保所有呼叫者的安全。这不应该影响基准数据，只是确保周围的代码对所有调用者都是安全的。具有更高内存带宽的现代 CPU 应该会看到 SIMD 比一次 4 字节标量有更大的加速，但最大的好处是当数据在缓存中很热时（在较小的块中工作，或在较小的总大小上工作）。

2020 年，最好的选择是使用可移植的 _mm_loadu_si128 内在函数版本，该版本将编译为等效的 asm 循环：https://gcc.gnu.org/wiki/DontUseInlineAsm。

另请注意，所有这些都会覆盖输出末尾之后的 1 个（标量）或 4 个（SIMD）字节，因此如果出现问题，请单独执行最后 3 个字节。

--- @PeterCordes

第一个版本使用 C 循环使用 OSSwapInt32 函数（使用 -O3< 编译为 bswap 指令）单独转换每个像素/代码>）。

void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
    unsigned x;
    for(x = 0; x < imageSize; x++) {
        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
        // warning: strict-aliasing UB.  Use memcpy for unaligned loads/stores
    }
}

第二种方法执行相同的操作，但使用内联汇编循环而不是 C 循环。

void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
    asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
        "0:\n\t"                   // do {
        "movl   (%1),%%eax\n\t"
        "bswapl %%eax\n\t"
        "movl   %%eax,(%0)\n\t"    // copy a dword byte-reversed
        "add    $4,%1\n\t"         // orig += 4 bytes
        "add    $3,%0\n\t"         // dest += 3 bytes
        "dec    %2\n\t"
        "jnz    0b"                // }while(--imageSize)
        : "+r" (dest), "+r" (orig), "+r" (imageSize)
        : // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
        : "flags", "eax", "memory"
    );
}

第三个版本是只是一个装腔作势的答案。我将内置函数转换为 GCC 等效函数，并使用 lddqu 内置函数，以便输入参数不需要对齐。（编者注：只有 P4 曾从 lddqu 中受益；使用 movdqu 很好，但没有缺点。）

typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
    }
}

最后，第四个版本是与第三个版本等效的内联汇编。

void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    asm volatile (
        "lddqu  %3,%%xmm1\n\t"
        "0:\n\t"
        "lddqu  (%1),%%xmm0\n\t"
        "pshufb %%xmm1,%%xmm0\n\t"
        "movdqu %%xmm0,(%0)\n\t"
        "add    $16,%1\n\t"
        "add    $12,%0\n\t"
        "sub    $4,%2\n\t"
        "jnz    0b"
        : "+r" (dest), "+r" (orig), "+r" (imagesize)
        : "m" (mask)  // whole array as a memory operand.  "x" would get the compiler to load it
        : "flags", "xmm0", "xmm1", "memory"
    );
}

（这些都使用 GCC9.3 可以正常编译，但 clang10 不知道 __builtin_ia32_pshufb128；使用 _mm_shuffle_epi8。）

在我的 2010 MacBook Pro、2.4 Ghz i5 (Westmere/Arrandale)、4GB RAM 上，这些是每个的平均时间：

Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3:  9.3163 milliseconds
Version 4:  9.3584 milliseconds

如您所见，编译器在优化方面足够好你不需要编写汇编。此外，矢量函数在 32MB 数据上仅快了 1.5 毫秒，因此如果您想支持最早的 Intel Mac（不支持 SSSE3），不会造成太大损害。

编辑：liori 要求提供标准差信息。不幸的是，我没有保存数据点，所以我又进行了 25 次迭代的测试。

              Average    | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1:   11.13120 ms | 0.81076 ms (7.3%)
Version 2:   11.27092 ms | 0.66209 ms (5.9%)
Version 3:    9.29184 ms | 0.27851 ms (3.0%)
Version 4:    9.40948 ms | 0.32702 ms (3.5%)

另外，这是新测试的原始数据，以防有人需要。对于每次迭代，随机生成一个 32MB 的数据集并运行四个函数。下面列出了每个函数的运行时间（以微秒为单位）。

Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1:   10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2:   10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3:    9036  9619  9341  8970  9453  9758  9043 10114  9243  9027  9163  9176  9168  9122  9514  9049  9161  9086  9064  9604  9178  9233  9301  9717  9156
Version 4:    9339 10119  9846  9217  9526  9182  9145 10286  9051  9614  9249  9653  9799  9270  9173  9103  9132  9550  9147  9157  9199  9113  9699  9354  9314

I wrote 4 different versions which work by swapping bytes. I compiled them using gcc 4.2.1 with -O3 -mssse3, ran them 10 times over 32MB of random data and found the averages.

Editor's note: the original inline asm used unsafe constraints, e.g. modifying input-only operands, and not telling the compiler about the side effect on memory pointed-to by pointer inputs in registers. Apparently this worked ok for the benchmark. I fixed the constraints to be properly safe for all callers. This should not affect benchmark numbers, only make sure the surrounding code is safe for all callers. Modern CPUs with higher memory bandwidth should see a bigger speedup for SIMD over 4-byte-at-a-time scalar, but the biggest benefits are when data is hot in cache (work in smaller blocks, or on smaller total sizes).

In 2020, your best bet is to use the portable _mm_loadu_si128 intrinsics version that will compile to an equivalent asm loop: https://gcc.gnu.org/wiki/DontUseInlineAsm.

Also note that all of these over-write 1 (scalar) or 4 (SIMD) bytes past the end of the output, so do the last 3 bytes separately if that's a problem.

--- @PeterCordes

The first version uses a C loop to convert each pixel separately, using the OSSwapInt32 function (which compiles to a bswap instruction with -O3).

void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
    unsigned x;
    for(x = 0; x < imageSize; x++) {
        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
        // warning: strict-aliasing UB.  Use memcpy for unaligned loads/stores
    }
}

The second method performs the same operation, but uses an inline assembly loop instead of a C loop.

void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
    asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
        "0:\n\t"                   // do {
        "movl   (%1),%%eax\n\t"
        "bswapl %%eax\n\t"
        "movl   %%eax,(%0)\n\t"    // copy a dword byte-reversed
        "add    $4,%1\n\t"         // orig += 4 bytes
        "add    $3,%0\n\t"         // dest += 3 bytes
        "dec    %2\n\t"
        "jnz    0b"                // }while(--imageSize)
        : "+r" (dest), "+r" (orig), "+r" (imageSize)
        : // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
        : "flags", "eax", "memory"
    );
}

The third version is a modified version of just a poseur's answer. I converted the built-in functions to the GCC equivalents and used the lddqu built-in function so that the input argument doesn't need to be aligned. (Editor's note: only P4 ever benefited from lddqu; it's fine to use movdqu but there's no downside.)

typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
    }
}

Finally, the fourth version is the inline assembly equivalent of the third.

void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    asm volatile (
        "lddqu  %3,%%xmm1\n\t"
        "0:\n\t"
        "lddqu  (%1),%%xmm0\n\t"
        "pshufb %%xmm1,%%xmm0\n\t"
        "movdqu %%xmm0,(%0)\n\t"
        "add    $16,%1\n\t"
        "add    $12,%0\n\t"
        "sub    $4,%2\n\t"
        "jnz    0b"
        : "+r" (dest), "+r" (orig), "+r" (imagesize)
        : "m" (mask)  // whole array as a memory operand.  "x" would get the compiler to load it
        : "flags", "xmm0", "xmm1", "memory"
    );
}

(These all compile fine with GCC9.3, but clang10 doesn't know __builtin_ia32_pshufb128; use _mm_shuffle_epi8.)

On my 2010 MacBook Pro, 2.4 Ghz i5 (Westmere/Arrandale), 4GB RAM, these were the average times for each:

Version 1: 10.8630 milliseconds
Version 2: 11.3254 milliseconds
Version 3:  9.3163 milliseconds
Version 4:  9.3584 milliseconds

As you can see, the compiler is good enough at optimization that you don't need to write assembly. Also, the vector functions were only 1.5 milliseconds faster on 32MB of data, so it won't cause much harm if you want to support the earliest Intel macs, which didn't support SSSE3.

Edit: liori asked for standard deviation information. Unfortunately, I hadn't saved the data points, so I ran another test with 25 iterations.

              Average    | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1:   11.13120 ms | 0.81076 ms (7.3%)
Version 2:   11.27092 ms | 0.66209 ms (5.9%)
Version 3:    9.29184 ms | 0.27851 ms (3.0%)
Version 4:    9.40948 ms | 0.32702 ms (3.5%)

Also, here is the raw data from the new tests, in case anyone wants it. For each iteration, a 32MB data set was randomly generated and run through the four functions. The runtime of each function in microseconds is listed below.

Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845
Version 1:   10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601
Version 2:   10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936
Version 3:    9036  9619  9341  8970  9453  9758  9043 10114  9243  9027  9163  9176  9168  9122  9514  9049  9161  9086  9064  9604  9178  9233  9301  9717  9156
Version 4:    9339 10119  9846  9217  9526  9182  9145 10286  9051  9614  9249  9653  9799  9270  9173  9103  9132  9550  9147  9157  9199  9113  9699  9354  9314

回复收藏 0 原文

国粹 2024-12-03 13:05:53

显而易见，使用 pshufb。

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 4 == 0);
    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
    }
}

The obvious, using pshufb.

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 4 == 0);
    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
    }
}

回复收藏 0 原文

二手情话 2024-12-03 13:05:53

仅结合poseur和Jitamar的答案，如果假设输入和输出是16字节对齐的，并且一次处理像素4，则可以使用shuffle、mask、ands和ors的组合来使用aligned存储商店。主要思想是生成四个中间数据集，然后将它们与掩码一起选择相关像素值并写出3个16字节的像素数据集。请注意，我根本没有编译它或尝试运行它。

EDIT2：有关底层代码结构的更多详细信息：

使用 SSE2，您可以通过 16 字节对齐读取和 16 字节写入获得更好的性能。由于 3 字节像素每 16 个像素只能对齐到 16 字节，因此我们使用随机播放和掩码以及一次 16 个输入像素的组合来一次批量处理 16 个像素。

从 LSB 到 MSB，输入看起来像这样，忽略特定组件：

s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333

输出看起来像这样：

d[0]: 000 000 000 000 111 1
d[1]:  11 111 111 222 222 22
d[2]:   2 222 333 333 333 333

因此，要生成这些输出，您需要执行以下操作（稍后我将指定实际的转换）：

d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

现在，应该 < code>combine_是什么样子的？如果我们假设 d 只是将 s 压缩在一起，我们可以用掩码和 or 连接两个 s：

combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

其中（1 表示选择左侧像素，0表示选择右侧像素）：
掩码(0)= 111 111 111 111 000 0
掩码(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000

但实际的转换（f__low, f__high）实际上并不那么简单。由于我们正在反转并从源像素中删除字节，因此实际的转换是（为了简洁起见，对于第一个目标）：

d[0]= 
    s[0][0].Blue s[0][0].Green s[0][0].Red 
    s[0][1].Blue s[0][1].Green s[0][1].Red 
    s[0][2].Blue s[0][2].Green s[0][2].Red 
    s[0][3].Blue s[0][3].Green s[0][3].Red
    s[1][0].Blue s[1][0].Green s[1][0].Red
    s[1][1].Blue

如果将上述内容转换为从源到目标的字节偏移量，您将得到：
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7

（如果你查看所有 s[0] 偏移量，它们仅以相反的顺序匹配poseur的随机播放掩码。）

现在，我们可以生成一个随机播放掩码以将每个源字节映射到目标字节（X 表示我们不关心该值是什么）：

f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7

f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10

f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13

我们可以通过查看用于每个源像素的掩码来进一步优化它。如果您看一下我们用于 s[1] 的随机播放蒙版：

f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X

由于两个随机播放蒙版不重叠，我们可以将它们组合起来，并简单地屏蔽掉 merge_ 中不相关的像素，我们已经这样做了！以下代码执行所有这些优化（此外，它还假设源地址和目标地址是 16 字节对齐的）。此外，掩码以 MSB->LSB 顺序以代码形式写出，以防您对顺序感到困惑。

编辑：将存储更改为 _mm_stream_si128 因为您可能会进行大量写入，而我们不一定要刷新缓存。另外，无论如何它都应该对齐，这样你就可以获得免费的性能！

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 16 == 0);

    __m128i shuf0 = _mm_set_epi8(
        -128, -128, -128, -128, // top 4 bytes are not used
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel

    __m128i shuf1 = _mm_set_epi8(
        7, 1, 2, 3, // top 4 bytes go to the first pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel

    __m128i shuf2 = _mm_set_epi8(
        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9); // bottom 4 go to third pixel

    __m128i shuf3 = _mm_set_epi8(
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
        -128, -128, -128, -128); // unused

    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);

    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 64, dest += 48) {
        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);

        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
    }
}

Combining just a poseur's and Jitamaro's answers, if you assume that the inputs and outputs are 16-byte aligned and if you process pixels 4 at a time, you can use a combination of shuffles, masks, ands, and ors to store out using aligned stores. The main idea is to generate four intermediate data sets, then or them together with masks to select the relevant pixel values and write out 3 16-byte sets of pixel data. Note that I did not compile this or try to run it at all.

EDIT2: More detail about the underlying code structure:

With SSE2, you get better performance with 16-byte aligned reads and writes of 16 bytes. Since your 3 byte pixel is only alignable to 16-bytes for every 16 pixels, we batch up 16 pixels at a time using a combination of shuffles and masks and ors of 16 input pixels at a time.

From LSB to MSB, the inputs look like this, ignoring the specific components:

s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333

and the ouptuts look like this:

d[0]: 000 000 000 000 111 1
d[1]:  11 111 111 222 222 22
d[2]:   2 222 333 333 333 333

So to generate those outputs, you need to do the following (I will specify the actual transformations later):

d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

Now, what should combine_<x> look like? If we assume that d is merely s compacted together, we can concatenate two s's with a mask and an or:

combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

where (1 means select the left pixel, 0 means select the right pixel):
mask(0)= 111 111 111 111 000 0
mask(1)= 11 111 111 000 000 00
mask(2)= 1 111 000 000 000 000

But the actual transformations (f_<x>_low, f_<x>_high) are actually not that simple. Since we are reversing and removing bytes from the source pixel, the actual transformation is (for the first destination for brevity):

d[0]= 
    s[0][0].Blue s[0][0].Green s[0][0].Red 
    s[0][1].Blue s[0][1].Green s[0][1].Red 
    s[0][2].Blue s[0][2].Green s[0][2].Red 
    s[0][3].Blue s[0][3].Green s[0][3].Red
    s[1][0].Blue s[1][0].Green s[1][0].Red
    s[1][1].Blue

If you translate the above into byte offsets from source to dest, you get:
d[0]=
&s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5
&s[0]+11 &s[0]+10 &s[0]+9
&s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7

(If you take a look at all the s[0] offsets, they match just a poseur's shuffle mask in reverse order.)

Now, we can generate a shuffle mask to map each source byte to a destination byte (X means we don't care what that value is):

f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7

f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10

f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13

We can further optimize this by looking the masks we use for each source pixel. If you take a look at the shuffle masks that we use for s[1]:

f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X

Since the two shuffle masks don't overlap, we can combine them and simply mask off the irrelevant pixels in combine_, which we already did! The following code performs all these optimizations (plus it assumes that the source and destination addresses are 16-byte aligned). Also, the masks are written out in code in MSB->LSB order, in case you get confused about the ordering.

EDIT: changed the store to _mm_stream_si128 since you are likely doing a lot of writes and we don't want to necessarily flush the cache. Plus it should be aligned anyway so you get free perf!

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 16 == 0);

    __m128i shuf0 = _mm_set_epi8(
        -128, -128, -128, -128, // top 4 bytes are not used
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel

    __m128i shuf1 = _mm_set_epi8(
        7, 1, 2, 3, // top 4 bytes go to the first pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel

    __m128i shuf2 = _mm_set_epi8(
        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9); // bottom 4 go to third pixel

    __m128i shuf3 = _mm_set_epi8(
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
        -128, -128, -128, -128); // unused

    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);

    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 64, dest += 48) {
        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);

        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
    }
}

回复收藏 0 原文

装纯掩盖桑 2024-12-03 13:05:53

我来晚了一点，似乎社区已经决定使用poseur的 pshufb-answer，但分配了2000个声誉，这非常慷慨，我必须尝试一下。

这是我的版本，没有特定于平台的内在函数或特定于机器的汇编，我包含了一些跨平台计时代码，如果您像我一样进行位操作和，则显示4x加速 > 激活编译器优化（寄存器优化、循环展开）：

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define UInt8 unsigned char

#define IMAGESIZE (1920*1080) 
int main() {
    time_t  t0, t1;
    int frames;
    int frame; 
    typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
    typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

    ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
    if(!orig) {printf("nomem1");}
    BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
    if(!dest) {printf("nomem2");}

    printf("to start original hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    for(frame = 0; frame<frames; frame++) {
        int x; for(x = 0; x < IMAGESIZE; x++) {
            dest[x].Red = orig[x].Red;
            dest[x].Green = orig[x].Green;
            dest[x].Blue = orig[x].Blue;
            x++;
        }
    }
    t1 = time(0);
    printf("finished original of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook the original took 16 sec 
    // (8 sec with compiler optimization -O3) so at 60 FPS 
    // (instead of the 1200) this would be faster than realtime 
    // (if you disregard any other rendering you have to do). 
    // However if you either want to do other/more processing 
    // OR want faster than realtime processing for e.g. a video-conversion 
    // program then this would have to be a lot faster still.

    printf("to start alternative hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    unsigned int* reader;
    unsigned int* end = reader+IMAGESIZE;
    unsigned int cur; // your question guarantees 32 bit cpu
    unsigned int next;
    unsigned int temp;
    unsigned int* writer;
    for(frame = 0; frame<frames; frame++) {
        reader = (void*)orig;
        writer = (void*)dest;
        next = *reader;
        reader++;
        while(reader<end) {
            cur = next;
            next = *reader;         
            // in the following the numbers are of course the bitmasks for 
            // 0-7 bits, 8-15 bits and 16-23 bits out of the 32
            temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); 
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
            *writer = temp;
            reader++;
            writer++;
        }
    }
    t1 = time(0);
    printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook this alternative took 10 sec 
    // (4 sec with compiler optimization -O3)

}

结果如下（在我的 core 2 子笔记本上）：

F:\>gcc b.c -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds

F:\>gcc b.c -O3 -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

I am coming a little late to the party, seeming that the community has already decided for poseur's pshufb-answer but distributing 2000 reputation, that is so extremely generous i have to give it a try.

Here's my version without platform specific intrinsics or machine-specific asm, i have included some cross-platform timing code showing a 4x speedup if you do both the bit-twiddling like me AND activate compiler-optimization (register-optimization, loop-unrolling):

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define UInt8 unsigned char

#define IMAGESIZE (1920*1080) 
int main() {
    time_t  t0, t1;
    int frames;
    int frame; 
    typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
    typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

    ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
    if(!orig) {printf("nomem1");}
    BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
    if(!dest) {printf("nomem2");}

    printf("to start original hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    for(frame = 0; frame<frames; frame++) {
        int x; for(x = 0; x < IMAGESIZE; x++) {
            dest[x].Red = orig[x].Red;
            dest[x].Green = orig[x].Green;
            dest[x].Blue = orig[x].Blue;
            x++;
        }
    }
    t1 = time(0);
    printf("finished original of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook the original took 16 sec 
    // (8 sec with compiler optimization -O3) so at 60 FPS 
    // (instead of the 1200) this would be faster than realtime 
    // (if you disregard any other rendering you have to do). 
    // However if you either want to do other/more processing 
    // OR want faster than realtime processing for e.g. a video-conversion 
    // program then this would have to be a lot faster still.

    printf("to start alternative hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    unsigned int* reader;
    unsigned int* end = reader+IMAGESIZE;
    unsigned int cur; // your question guarantees 32 bit cpu
    unsigned int next;
    unsigned int temp;
    unsigned int* writer;
    for(frame = 0; frame<frames; frame++) {
        reader = (void*)orig;
        writer = (void*)dest;
        next = *reader;
        reader++;
        while(reader<end) {
            cur = next;
            next = *reader;         
            // in the following the numbers are of course the bitmasks for 
            // 0-7 bits, 8-15 bits and 16-23 bits out of the 32
            temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); 
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
            *writer = temp;
            reader++;
            writer++;
        }
    }
    t1 = time(0);
    printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook this alternative took 10 sec 
    // (4 sec with compiler optimization -O3)

}

The results are these (on my core 2 subnotebook):

F:\>gcc b.c -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds

F:\>gcc b.c -O3 -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

回复收藏 0 原文

逆蝶 2024-12-03 13:05:53

您想要使用 Duff 的设备： http://en.wikipedia.org/wiki/Duff%27s_device 。它也可以在 JavaScript 中运行。然而这篇文章读起来有点有趣http://lkml.indiana。 edu/hypermail/linux/kernel/0008.2/0171.html。想象一下 Duff 设备有 512 KB 的移动量。

回复收藏 0 原文

℡寂寞咖啡 2024-12-03 13:05:53

结合此处的快速转换函数之一，如果可以访问 Core 2，则明智的做法是将转换拆分为线程，这些线程处理其数据，例如四分之一的数据，如以下伪代码所示：

void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
       thread threads[] = {
           create_thread(bgrFromArgb, dest, src, n/4),
           create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
           create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
           create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
       }
       join_threads(threads);
}

In combination with one of the fast conversion functions here, given access to Core 2s it might be wise to split the translation into threads, which work on their, say, fourth of the data, as in this psudeocode:

void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
       thread threads[] = {
           create_thread(bgrFromArgb, dest, src, n/4),
           create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
           create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
           create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
       }
       join_threads(threads);
}

回复收藏 0 原文

风追烟花雨 2024-12-03 13:05:53

这个汇编函数应该可以，但是我不知道你是否想保留旧数据，这个函数会覆盖它。

该代码适用于具有英特尔汇编风格的 MinGW GCC，您必须对其进行修改以适合您的编译器/汇编器。

extern "C" {
    int convertARGBtoBGR(uint buffer, uint size);
    __asm(
        ".globl _convertARGBtoBGR\n"
        "_convertARGBtoBGR:\n"
        "  push ebp\n"
        "  mov ebp, esp\n"
        "  sub esp, 4\n"
        "  mov esi, [ebp + 8]\n"
        "  mov edi, esi\n"
        "  mov ecx, [ebp + 12]\n"
        "  cld\n"
        "  convertARGBtoBGR_loop:\n"
        "    lodsd          ; load value from [esi] (4byte) to eax, increment esi by 4\n"
        "    bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
        "    stosd          ; store 4 bytes to [edi], increment  edi by 4\n"
        "    sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
        "    loop convertARGBtoBGR_loop\n"
        "  leave\n"
        "  ret\n"
    );
}

你应该这样称呼它：

convertARGBtoBGR( &buffer, IMAGESIZE );

与你的暴力方法相比，这个函数每个像素/数据包仅访问内存两次（1次读取，1次写入）（至少/假设它被编译为注册） ) 3 次读取和 3 次写入操作。方法是相同的，但实施使其更加有效。

This assembly function should do, however I don't know if you would like to keep old data or not, this function overrides it.

The code is for MinGW GCC with intel assembly flavour, you will have to modify it to suit your compiler/assembler.

extern "C" {
    int convertARGBtoBGR(uint buffer, uint size);
    __asm(
        ".globl _convertARGBtoBGR\n"
        "_convertARGBtoBGR:\n"
        "  push ebp\n"
        "  mov ebp, esp\n"
        "  sub esp, 4\n"
        "  mov esi, [ebp + 8]\n"
        "  mov edi, esi\n"
        "  mov ecx, [ebp + 12]\n"
        "  cld\n"
        "  convertARGBtoBGR_loop:\n"
        "    lodsd          ; load value from [esi] (4byte) to eax, increment esi by 4\n"
        "    bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
        "    stosd          ; store 4 bytes to [edi], increment  edi by 4\n"
        "    sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
        "    loop convertARGBtoBGR_loop\n"
        "  leave\n"
        "  ret\n"
    );
}

You should call it like so:

convertARGBtoBGR( &buffer, IMAGESIZE );

This function is accessing memory only twice per pixel/packet (1 read, 1 write) comparing to your brute force method that had (at least / assuming it was compiled to register) 3 read and 3 write operations. Method is the same but implementation makes it more efficent.

回复收藏 0 原文

柒七 2024-12-03 13:05:53

您可以以 4 个像素为单位进行操作，使用无符号长指针移动 32 位。试想一下，使用 4 个 32 位像素，您可以通过移位和 OR/AND 来构造，3 个字代表 4 个 24 位像素，如下所示：

//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR  32 bits writing (4 pixels)

在所有现代 32/64 位处理器中，移位操作始终由 1 个指令周期完成（桶移位技术）所以这是构造这 3 个单词进行写入的最快方法，按位 AND 和 OR 也非常快。

像这样：

//assuming we have 4 ARGB1 ... ARGB4 pixels and  3 32 bits words,  W1, W2 and W3 to write
// and *dest  its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;

依此类推......循环中的下一个像素。

您需要对不是 4 的倍数的图像进行一些调整，但我敢打赌这是最快的方法，无需使用汇编程序。

顺便说一句，忘记使用结构和索引访问，这些是移动数据的慢方式，只需看一下已编译的 C++ 程序的反汇编列表，您就会同意与我一起。

You can do it in chunks of 4 pixels, moving 32 bits with unsigned long pointers. Just think that with 4 32 bits pixels you can construct by shifting and OR/AND, 3 words representing 4 24bits pixels, like this:

//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR  32 bits writing (4 pixels)

Shifting operations are always done by 1 instruction cycle in all modern 32/64 bits processors (barrel shifting technique) so its the fastest way of constructing those 3 words for writing, bitwise AND and OR are also blazing fast.

Like this:

//assuming we have 4 ARGB1 ... ARGB4 pixels and  3 32 bits words,  W1, W2 and W3 to write
// and *dest  its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;

and so on.... with next pixels in a loop.

You'll need some adjusting with images that are not multiple of 4, but I bet this is the fastest approach of all, without using assembler.

And btw, forget about using structs and indexed access, those are the SLOWER ways of all for moving data, just take a look at a disassembly listing of a compiled C++ program and you'll agree with me.

回复收藏 0 原文

安稳善良 2024-12-03 13:05:53

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

除了程序集或编译器内在函数之外，我可能会尝试执行以下操作，同时非常仔细地验证最终行为，因为其中一些（涉及联合）可能依赖于编译器实现：

union uARGB
{
   struct ARGB argb;
   UInt32 x;
};
union uBGRA
{
   struct 
   {
     BGR bgr;
     UInt8 Alpha;
   } bgra;
   UInt32 x;
};

然后您的代码内核，无论循环展开是否合适：

inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
    uARGB* puargb = (uARGB*)pargb;
    uBGRA ubgra;
    ubgra.x = __byte_reverse_32(pargb->x);
    *pbgr = ubgra.bgra.bgr;
}

其中 __byte_reverse_32() 假设存在反转 32 位字字节的编译器内部函数。

总结一下基本方法：

将 ARGB 结构视为 32 位整数
反转 32 位整数
将反转的 32 位整数视为 (BGR)A 结构
让编译器复制 (BGR)A 的 (BGR) 部分结构

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

Aside from assembly or compiler intrinsics, I might try doing the following, while very carefully verifying the end behavior, as some of it (where unions are concerned) is likely to be compiler implementation dependent:

union uARGB
{
   struct ARGB argb;
   UInt32 x;
};
union uBGRA
{
   struct 
   {
     BGR bgr;
     UInt8 Alpha;
   } bgra;
   UInt32 x;
};

and then for your code kernel, with whatever loop unrolling is appropriate:

inline void argb2bgr(BGR* pbgr, ARGB* pargb)
{
    uARGB* puargb = (uARGB*)pargb;
    uBGRA ubgra;
    ubgra.x = __byte_reverse_32(pargb->x);
    *pbgr = ubgra.bgra.bgr;
}

where __byte_reverse_32() assumes the existence of a compiler intrinsic that reverses the bytes of a 32-bit word.

To summarize the underlying approach:

view ARGB structure as a 32-bit integer
reverse the 32-bit integer
view the reversed 32-bit integer as a (BGR)A structure
let the compiler copy the (BGR) portion of the (BGR)A structure

回复收藏 0 原文

溺ぐ爱和你が 2024-12-03 13:05:53

虽然您可以根据CPU使用情况使用一些技巧，

This kind of operations can be done fasted with GPU.

但似乎您使用C/C++...因此您GPU编程的替代方案可能是（在Windows平台上）

DirectCompute（DirectX 11）观看此视频
Microsoft 研究项目加速器检查此链接
Cuda
“google”GPU编程...

很快就可以使用GPU进行此类数组运算以进行更快的计算。它们就是为此而设计的。

Although you can use some tricks based on CPU usage,

This kind of operations can be done fasted with GPU.

It seems that you use C/ C++... So your alternatives for GPU programming may be ( on windows platform )

DirectCompute ( DirectX 11 ) See this video
Microsoft Research Project Accelerator Check this link
Cuda
"google" GPU programming ...

Shortly use GPU for this kind of array operations for make faster calculations. They are designed for it.

回复收藏 0 原文

何以笙箫默 2024-12-03 13:05:53

我还没有看到有人展示如何在 GPU 上执行此操作的示例。

不久前我写了一些与你的问题类似的东西。我从 video4linux2 相机接收到 YUV 格式的数据，并希望将其绘制为屏幕上的灰度级（仅 Y 分量）。我还想用蓝色绘制太暗的区域，用红色绘制过饱和的区域。

我从 freeglut分布。

数据作为 YUV 复制到纹理中，然后应用以下 GLSL 着色器程序。我确信 GLSL 代码现在可以在所有 Mac 上运行，并且它将比所有 CPU 方法快得多。

请注意，我对如何取回数据没有经验。理论上 glReadPixels 应该读回数据，但我从未测量过它的性能。

OpenCL 可能是更简单的方法，但只有当我有支持它的笔记本时，我才会开始为此进行开发。

(defparameter *vertex-shader*
"void main(){
    gl_Position    = gl_ModelViewProjectionMatrix * gl_Vertex;
    gl_FrontColor  = gl_Color;
    gl_TexCoord[0] = gl_MultiTexCoord0;
}
")

(progn
 (defparameter *fragment-shader*
   "uniform sampler2D textureImage;
void main()
{
  vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
  float v=q.z;
  if(int(gl_FragCoord.x)%2 == 0)
     v=q.x; 
  float x=0; // 1./255.;
  v-=.278431;
  v*=1.7;
  if(v>=(1.0-x))
    gl_FragColor = vec4(255,0,0,255);
  else if (v<=x)
    gl_FragColor = vec4(0,0,255,255);
  else
    gl_FragColor = vec4(v,v,v,255); 
}
")

在此处输入图像描述

I haven't seen anyone showing an example of how to do it on the GPU.

A while ago I wrote something similar to your problem. I received data from a video4linux2 camera in YUV format and wanted to draw it as gray levels on the screen (just the Y component). I also wanted to draw areas that are too dark in blue and oversaturated regions in red.

I started out with the smooth_opengl3.c example from the freeglut distribution.

The data is copied as YUV into the texture and then the following GLSL shader programs are applied. I'm sure GLSL code runs on all macs nowadays and it will be significantly faster than all the CPU approaches.

Note that I have no experience on how you get the data back. In theory glReadPixels should read the data back but I never measured its performance.

OpenCL might be the easier approach, but then I will only start developing for that when I have a notebook that supports it.

(defparameter *vertex-shader*
"void main(){
    gl_Position    = gl_ModelViewProjectionMatrix * gl_Vertex;
    gl_FrontColor  = gl_Color;
    gl_TexCoord[0] = gl_MultiTexCoord0;
}
")

(progn
 (defparameter *fragment-shader*
   "uniform sampler2D textureImage;
void main()
{
  vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
  float v=q.z;
  if(int(gl_FragCoord.x)%2 == 0)
     v=q.x; 
  float x=0; // 1./255.;
  v-=.278431;
  v*=1.7;
  if(v>=(1.0-x))
    gl_FragColor = vec4(255,0,0,255);
  else if (v<=x)
    gl_FragColor = vec4(0,0,255,255);
  else
    gl_FragColor = vec4(v,v,v,255); 
}
")

enter image description here

回复收藏 0 原文

~没有更多了~