在 C 中使用跨步复制内存的最快方法？

发布于 2024-11-17 13:23:52 字数 614 浏览 4 评论 0原文

我正在尝试尽快从 RGBA 图像数据复制 1 或 2 个颜色通道（这是我的代码中最慢的部分，它会减慢整个应用程序的速度）。有没有一种快速的跨步复制方法？

数据简单地布置为 RGBARGBARGBA 等，我需要仅复制 R 值，或者在另一种情况下仅复制 RG 值。

到目前为止，我所做的大致是复制 R 值：

for(int i=0; i<dataSize; i++){
    dest[i] = source[i*4];
}

对于 RG 值，我正在做：

for(int i=0; i<dataSize; i+=2){
    dest[i] = source[i*2];
    dest[i+1] = source[(i*2)+1];
}

所有数据都是无符号 1 字节值。有更快的方法吗？我已经部分展开了循环（每次迭代执行 64 个值 - 除此之外的加速效果微不足道）。平台是 Armv7 (iOS)，因此使用 NEON (SIMD) 可能有用，不幸的是我对此的经验为零！

不幸的是，更改数据是不可能的，它是由 opengl 的 readPixels() 函数提供的，据我所知，iOS 不支持读取 L、LA、RG 等。

原文

I'm trying to copy 1 or 2 colour channels from RGBA image data as quickly as possible (this is the slowest part of my code, and it's slowing the whole app down). Is there a fast way of copying with stride?

The data is simply laid out as RGBARGBARGBA etc., and I need to copy just the R values, or in another case just the RG values.

What I have so far is roughly this to copy the R values:

for(int i=0; i<dataSize; i++){
    dest[i] = source[i*4];
}

For the RG values, I'm doing:

for(int i=0; i<dataSize; i+=2){
    dest[i] = source[i*2];
    dest[i+1] = source[(i*2)+1];
}

All the data is unsigned 1-byte values. Is there a faster way? I've already partially unrolled the loop (doing 64 values per iteration - insignificant speedup beyond that). Platform is Armv7 (iOS), so using NEON (SIMD) might be useful, I've zero experience with that unfortunately!

Changing the data is unfortunately out of the question, it's provided by opengl's readPixels() function, and iOS doesn't support reading as L, LA, RG etc. so far as I've been able to tell.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不可一世的女人 2024-11-24 13:23:52

如果您可以使用 iOS4 及更高版本，您可能会发现 vDSP 和加速框架很有用。查看文档以了解各种快速图像处理的优点。

#import <Accelerate/Accelerate.h>

我不知道你接下来要做什么，但是如果你正在对图像数据进行任何形式的计算，并且希望它采用浮点形式，你可以使用 vDSP_vfltu8 将源字节数据的一个通道转换为单精度浮点像这样使用单行点（不包括内存管理）；

vDSP_vfltu8(srcData+0,4,destinationAsFloatRed,1,numberOfPixels)
vDSP_vfltu8(srcData+1,4,destinationAsFloatGreen,1,numberOfPixels)
vDSP_vfltu8(srcData+2,4,destinationAsFloatBlue,1,numberOfPixels)
vDSP_vfltu8(srcData+3,4,destinationAsFloatAlpha,1,numberOfPixels)

如果您随后需要从处理后的浮点数据创建图像，请使用 vDSP_vfuxu8 以其他方式返回 - 所以；

vDSP_vfixu8(destinationAsFloatRed,1,outputData+0,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatGreen,1,outputData+1,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatBlue,1,outputData+2,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatAlpha,1,outputData+3,4,numberOfPixels);

显然，您可以使用上述技术处理 1 或 2 个通道。

文档相当复杂，但结果很好。

If you're OK with iOS4 and above, you might find vDSP and the accelerate framework useful. Check out the documentation for all sorts of image manipulation goodness at warp speed.

#import <Accelerate/Accelerate.h>

I don't know what you do next, but if you're doing any form of calculation on the image data, and want it in floating point form, you can use vDSP_vfltu8 to convert one channel of the source byte data to single precision floating point using a single line like this (excluding the memory management);

vDSP_vfltu8(srcData+0,4,destinationAsFloatRed,1,numberOfPixels)
vDSP_vfltu8(srcData+1,4,destinationAsFloatGreen,1,numberOfPixels)
vDSP_vfltu8(srcData+2,4,destinationAsFloatBlue,1,numberOfPixels)
vDSP_vfltu8(srcData+3,4,destinationAsFloatAlpha,1,numberOfPixels)

If you then need to create an image from the manipulated floating point data, use vDSP_vfuxu8 to go back the other way - so;

vDSP_vfixu8(destinationAsFloatRed,1,outputData+0,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatGreen,1,outputData+1,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatBlue,1,outputData+2,4,numberOfPixels);
vDSP_vfixu8(destinationAsFloatAlpha,1,outputData+3,4,numberOfPixels);

Obviously you can just process 1 or 2 channels using the above technique.

The documentation is quite complex, but the results are good.

回复收藏 0 原文

情绪少女 2024-11-24 13:23:52

一如既往，加载和存储是最昂贵的操作。
您可以按以下方式优化代码：

加载一个 int (RGBA)
将所需部分存储在寄存器（临时变量）中
将数据移至临时变量中的正确位置。
执行此操作，直到本机处理器数据大小已满（32 位计算机上的字符为 4 倍），
将临时变量存储到内存。

只需快速输入代码即可传达想法。

unsigned int tmp;
unsigned int *dest;

for(int i=0; i<dataSize; i+=4){
    tmp  = (source[i] & 0xFF);
    tmp |= (source[i+1] & 0xFF) << 8;
    tmp |= (source[i+2] & 0xFF) << 16;
    tmp |= (source[i+3] & 0xFF) << 24;

    *dest++ = tmp;
}

As always load and store are the most expensive operations.
You could optimize your code in the following fashion:

Load one int (RGBA)
Store the required part in a register (temp variable)
Shift the data to the right place in the temp variable.
Do this until the native prozessor data size is full (4 times for chars on a 32bit machine)
store temp variable to memory.

The code is just fast typed to get the idea across.

unsigned int tmp;
unsigned int *dest;

for(int i=0; i<dataSize; i+=4){
    tmp  = (source[i] & 0xFF);
    tmp |= (source[i+1] & 0xFF) << 8;
    tmp |= (source[i+2] & 0xFF) << 16;
    tmp |= (source[i+3] & 0xFF) << 24;

    *dest++ = tmp;
}

回复收藏 0 原文

森林迷了鹿 2024-11-24 13:23:52

根据编译的代码，您可能希望通过添加第二个循环索引来替换乘法 2（将其称为 j 并将其前进 4）：

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$j];
    dest[$i+1] = source[$j+1];
}

或者，您可以用移位替换乘法通过 1：

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$i<<1];
    dest[$i+1] = source[($i<<1)+1];
}

Depending on the compiled code, you may want to replace the muliplication by 2 with addition of a second loop index (call it j and advance it by 4):

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$j];
    dest[$i+1] = source[$j+1];
}

Alternatively, you can replace the multiplication with a shift by 1:

for(int i=0, j=0; i<dataSize; i+=2, j+=4){
    dest[$i] = source[$i<<1];
    dest[$i+1] = source[($i<<1)+1];
}

回复收藏 0 原文

梦屿孤独相伴 2024-11-24 13:23:52

我更喜欢 while 人——你可以将其转换为 for，我确信

i = j = 0;
while (dataSize--) {
    dst[i++] = src[j++]; /* R */
    dst[i++] = src[j++]; /* G */
    j += 2;              /* ignore B and A */
}

至于它是否更快，你必须衡量。

I'm more of a while guy -- you can convert it to for, I'm sure

i = j = 0;
while (dataSize--) {
    dst[i++] = src[j++]; /* R */
    dst[i++] = src[j++]; /* G */
    j += 2;              /* ignore B and A */
}

As for it being faster, you have to measure.

回复收藏 0 原文

一绘本一梦想 2024-11-24 13:23:52

罗杰的答案可能是最干净的解决方案。拥有一个库来保持代码较小总是好的。但如果您只想优化 C 代码，您可以尝试不同的方法。首先你应该分析你的 dataSize 有多大。然后，您可以进行大量循环展开，可能与复制 int 而不是字节相结合：（伪代码）

while(dataSize-i > n) { // n being 10 or whatever
   *(int*)(src+i) = *(int*)(dest+i); i++; // or i+=4; depending what you copy
   *(int*)(src+i) = *(int*)(dest+i);
   ... n times
}

，然后执行其余操作：

switch(dataSize-i) {
    case n-1: *(src+i) = *(dest+i); i++;
    case n-2: ...
    case 1: ...
}

它变得有点丑陋..但它肯定很快:)

如果您知道，您可以优化更多dataSize 的行为方式。也许它总是2的幂？还是偶数？

我刚刚意识到你不能一次复制 4 个字节:) 但只能复制 2 个字节。无论如何，我只是想向您展示如何使用仅进行 1 次比较的 switch 语句来结束展开的循环。 IMO 是获得不错加速的唯一方法。

The answer from Roger is probably the cleanest solution. It's always good to have a library to keep your code small. But if you only want to optimize C code you can try different things. First you should analyze how big your dataSize is. You then can do heavy loop unrolling, probably combined with copying int's instead of bytes: (pseudo code)

while(dataSize-i > n) { // n being 10 or whatever
   *(int*)(src+i) = *(int*)(dest+i); i++; // or i+=4; depending what you copy
   *(int*)(src+i) = *(int*)(dest+i);
   ... n times
}

and then do the rest with:

switch(dataSize-i) {
    case n-1: *(src+i) = *(dest+i); i++;
    case n-2: ...
    case 1: ...
}

it gets a bit ugly.. but it sure is fast :)

you can optimize even more if you know how dataSize behaves. Maybe it's always a power of 2? Or an even number?

I just realized that you can't copy 4 bytes at once :) but only 2 bytes. Anyway, I just wanted to show you how to end an unrolled loop with a switch statement with only 1 comparison. IMO the only way to get a decent speedup.

回复收藏 0 原文

谁把谁当真 2024-11-24 13:23:52

你的问题仍然是现实的吗？几天前，我发布了 ASM 加速函数，用于跨步复制字节。它比相应的 C 代码快大约两倍。您可以在这里找到它： https://github.com/noveogroup/ios-aux 它可以修改为在 RG 字节复制的情况下复制字。

UPD：我发现我的解决方案仅在默认关闭编译器优化的调试模式下比 C 代码更快。在发布模式下，C 代码经过优化（默认情况下）并且与我的 ASM 代码一样快。

回复收藏 0 原文

多彩岁月 2024-11-24 13:23:52

希望我参加聚会还不算太晚！我刚刚使用 ARM NEON 内在函数在 iPad 上完成了类似的事情。与其他列出的答案相比，我的速度提高了 2-3 倍。请注意，下面的代码仅保留第一个通道，并要求数据为 32 字节的倍数。

uint32x4_t mask = vdupq_n_u32(0xFF);

for (unsigned int i=0, j=0; i < dataSize; i+=32, j+=8) {

    // Load eight 4-byte integers from the source
    uint32x4_t vec0 = vld1q_u32((const unsigned int *) &source[i]);
    uint32x4_t vec1 = vld1q_u32((const unsigned int *) &source[i+16]);

    // Zero everything but the first byte in each of the eight integers
    vec0 = vandq_u32(vec0, mask);
    vec1 = vandq_u32(vec1, mask);

    // Throw away two bytes for each of the original integers
    uint16x4_t vec0_s = vmovn_u32(vec0);
    uint16x4_t vec1_s = vmovn_u32(vec1);

    // Combine the remaining bytes into a single vector
    uint16x8_t vec01_s = vcombine_u16(vec0_s, vec1_s);

    // Throw away the last byte for each of the original integers
    uint8x8_t vec_o = vmovn_u16(vec01_s);

    // Store to destination
    vst1_u8(&dest[j], vec_o);
}

Hope I'm not too late to the party! I just accomplished something similar on the iPad using ARM NEON intrinsics. I get a 2-3x speed up compared to the other listed answers. Note that the code below keeps only the first channel and requires the data to be a multiple of 32 bytes.

uint32x4_t mask = vdupq_n_u32(0xFF);

for (unsigned int i=0, j=0; i < dataSize; i+=32, j+=8) {

    // Load eight 4-byte integers from the source
    uint32x4_t vec0 = vld1q_u32((const unsigned int *) &source[i]);
    uint32x4_t vec1 = vld1q_u32((const unsigned int *) &source[i+16]);

    // Zero everything but the first byte in each of the eight integers
    vec0 = vandq_u32(vec0, mask);
    vec1 = vandq_u32(vec1, mask);

    // Throw away two bytes for each of the original integers
    uint16x4_t vec0_s = vmovn_u32(vec0);
    uint16x4_t vec1_s = vmovn_u32(vec1);

    // Combine the remaining bytes into a single vector
    uint16x8_t vec01_s = vcombine_u16(vec0_s, vec1_s);

    // Throw away the last byte for each of the original integers
    uint8x8_t vec_o = vmovn_u16(vec01_s);

    // Store to destination
    vst1_u8(&dest[j], vec_o);
}

回复收藏 0 原文

青衫儰鉨ミ守葔 2024-11-24 13:23:52

您对 ASM 满意吗？我不熟悉 ARM 处理器，但在 Analog Devices 的 Blackfin 上，此副本实际上是免费的，因为它可以与计算操作并行完成：

i0 = _src_addr;
i1 = _dest_addr;
p0 = dataSize - 1;

r0 = [i0++];
loop _mycopy lc0 = p0;
loop_begin _mycopy;
    /* possibly do compute work here | */ r0 = [i0++] | W [i1++] = r0.l;
loop_end _mycopy;
W [i1++] = r0.l;

因此，您每个像素 1 个周期。请注意，按原样，这对于 RG 或 BA 副本很有用。正如我所说，我不熟悉 ARM，对 iOS 一无所知，所以我不确定您是否可以访问 ASM 代码，但您可以尝试寻找此类优化。

Are you comfortable with ASM? I am not familiar with ARM processors, but on the Analog Devices' Blackfin, this copy is actually FREE, since it can be done in parallel to a compute operation:

i0 = _src_addr;
i1 = _dest_addr;
p0 = dataSize - 1;

r0 = [i0++];
loop _mycopy lc0 = p0;
loop_begin _mycopy;
    /* possibly do compute work here | */ r0 = [i0++] | W [i1++] = r0.l;
loop_end _mycopy;
W [i1++] = r0.l;

So, you have 1 cycle per pixel. Note that as-is, this is good for RG or BA copy. As I said, I am not familiar with ARM and absolutely know nothing about iOS so I am not sure you even have access to ASM code, but you can try looking for that kind of optimizations.

回复收藏 0 原文

~没有更多了~