当前位置：文江博客话题详情

将 8 位 uint8_t 加载为 uint32_t？

发布于 2024-09-18 13:08:11 字数 534 浏览 20 评论 0 原文

我的图像处理项目使用灰度图像。我有 ARM Cortex-A8 处理器平台。我想利用 NEON。

我有一个灰度图像（考虑下面的示例），在我的算法中，我必须仅添加列。

如何并行加载四个 8 位像素值（即 uint8_t）作为四个 uint32_t 到 128 位 NEON 寄存器之一？我必须使用什么内在函数才能做到这一点？

我的意思是：

alt text

我必须将它们加载为 32 位，因为如果你仔细看，我执行 255 + 255 的那一刻是512，不能保存在8位寄存器中。

例如

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels)

原文

my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.

I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.

How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?

I mean:

alt text

I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.

e.g.

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烦人精 2024-09-25 13:08:11

我建议您花一些时间了解 SIMD 在 ARM 上的工作原理。看一下：

看一下：

帮助您入门。然后，您可以使用内联汇编器或 domen 推荐的相应 ARM 内在函数来实现 SIMD 代码。

回复收藏 0 原文

地狱即天堂 2024-09-25 13:08:11

取决于您的编译器和（可能缺少）扩展。

IE。对于 GCC，这可能是一个起点： http://gcc.gnu .org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

回复收藏 0 原文

水水月牙 2024-09-25 13:08:11

如果您需要对 480 个 8 位值求和，那么技术上您将需要 17 位中间存储。然而，如果分两个阶段执行加法，即先执行顶部 240 行，然后执行底部 240 行，则可以分别以 16 位进行。然后将两半的结果相加即可得到最终答案。

实际上有一个适合您的算法的 NEON 指令称为 vaddw。它将向 qword 向量添加一个 dword 向量，后者包含的元素宽度是前者的两倍。在您的情况下，vaddw.u8 可用于将 8 个像素添加到 8 个 16 位累加器中。然后，vaddw.u16 可用于将两组 8 个 16 位累加器添加到一组 8 个 32 位累加器中 - 请注意，您必须使用该指令两次才能获得两半累加器。

如有必要，您还可以使用 vmovn 或 vqmovn 将值转换回 16 位或 8 位。

回复收藏 0 原文

傲鸠 2024-09-25 13:08:11

没有指令可以将 4 个 8 位值加载到 4 个 32 位寄存器中。

您必须加载它们，然后使用 vshl 两次。
因为 neon 不能使用 32 个寄存器，所以您必须使用 8 个像素（而不是 4 个），

您只能使用 16 位寄存器。应该够了...

回复收藏 0 原文

眼泪都笑了 2024-09-25 13:08:11

使用单通道加载指令 (vld1 [], [) 将 4 个字节加载到 q 寄存器中，然后使用两个长移指令(vmovl) 将它们首先提升到 16 位，然后提升到 32 位。结果应该类似于（在 GNU 语法中）

vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)

如果您可以保证

是 4 字节对齐的，则写入 [

: 32]相反，在加载指令中，可以节省一两个周期。但是，如果您这样做并且地址未对齐，则会出现错误。

嗯，我刚刚意识到你想使用内在函数，而不是汇编，所以内在函数也是如此。

uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);

Load the 4 bytes using a single-lane load instruction (vld1 <register>[<lane>], [<address]) into a q-register, then use two move-long instructions (vmovl) to promote them first to 16 and then to 32 bit. The result should be something like (in GNU syntax)

vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)

If you can guarantee that <address> is 4-byte aligned, then write [<address>: 32] instead in the load instruction, to save a cycle or two. If you do that and the address isn't aligned, you'll get a fault, however.

Um, I just realized you want to use intrinsics, not assembly, so here's the same thing with intrinsics.

uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);

回复收藏 0 原文

~没有更多了~