将 8 位 uint8_t 加载为 uint32_t?

发布于 2024-09-18 13:08:11 字数 534 浏览 17 评论 0 原文

我的图像处理项目使用灰度图像。我有 ARM Cortex-A8 处理器平台。我想利用 NEON。

我有一个灰度图像(考虑下面的示例),在我的算法中,我必须仅添加列。

如何并行加载四个 8 位像素值(即 uint8_t)作为四个 uint32_t 到 128 位 NEON 寄存器之一?我必须使用什么内在函数才能做到这一点?

我的意思是:

alt text

我必须将它们加载为 32 位,因为如果你仔细看,我执行 255 + 255 的那一刻是512,不能保存在8位寄存器中。

例如

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels) 

my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.

I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.

How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?

I mean:

alt text

I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.

e.g.

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels) 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

地狱即天堂 2024-09-25 13:08:11

取决于您的编译器和(可能缺少)扩展。

IE。对于 GCC,这可能是一个起点: http://gcc.gnu .org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Depends on your compiler and (possible lack of) extensions.

Ie. for GCC, this might be a starting point: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

水水月牙 2024-09-25 13:08:11

如果您需要对 480 个 8 位值求和,那么技术上您将需要 17 位中间存储。然而,如果分两个阶段执行加法,即先执行顶部 240 行,然后执行底部 240 行,则可以分别以 16 位进行。然后将两半的结果相加即可得到最终答案。

实际上有一个适合您的算法的 NEON 指令称为 vaddw。它将向 qword 向量添加一个 dword 向量,后者包含的元素宽度是前者的两倍。在您的情况下,vaddw.u8 可用于将 8 个像素添加到 8 个 16 位累加器中。然后,vaddw.u16 可用于将两组 8 个 16 位累加器添加到一组 8 个 32 位累加器中 - 请注意,您必须使用该指令两次才能获得两半累加器。

如有必要,您还可以使用 vmovn 或 vqmovn 将值转换回 16 位或 8 位。

If you need to sum up to 480 8-bit values then you would technically need 17 bits of intermediate storage. However, if you perform the additions in two stages, ie, top 240 rows then bottom 240 rows, you can do it in 16-bits each. Then you can add the results from the two halves to get the final answer.

There is actually a NEON instruction that is suitable for your algorithm called vaddw. It will add a dword vector to a qword vector, with the latter containing elements that are twice as wide as the former. In your case, vaddw.u8 can be used to add 8 pixels to 8 16-bit accumulators. Then, vaddw.u16 can be used to add the two sets of 8 16-bit accumulators into one set of 8 32-bit ones - note that you must use the instruction twice to get both halves.

If necessary, you can also convert the values back to 16-bit or 8-bit by using vmovn or vqmovn.

傲鸠 2024-09-25 13:08:11

没有指令可以将 4 个 8 位值加载到 4 个 32 位寄存器中。

您必须加载它们,然后使用 vshl 两次。
因为 neon 不能使用 32 个寄存器,所以您必须使用 8 个像素(而不是 4 个),

您只能使用 16 位寄存器。应该够了...

There is not instruction that can load your 4 8bit value into 4 32bit register.

you must load them and then use a vshl twice.
because neon can't use 32 registers you'll have to work on 8 pixels (and not 4)

You can use only 16bits register. it should be enough...

眼泪都笑了 2024-09-25 13:08:11

使用单通道加载指令 (vld1 [], [) 将 4 个字节加载到 q 寄存器中,然后使用两个长移指令(vmovl) 将它们首先提升到 16 位,然后提升到 32 位。结果应该类似于(在 GNU 语法中)

vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)

如果您可以保证

是 4 字节对齐的,则写入 [

: 32]相反,在加载指令中,可以节省一两个周期。但是,如果您这样做并且地址未对齐,则会出现错误。

嗯,我刚刚意识到你想使用内在函数,而不是汇编,所以内在函数也是如此。

uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);

Load the 4 bytes using a single-lane load instruction (vld1 <register>[<lane>], [<address]) into a q-register, then use two move-long instructions (vmovl) to promote them first to 16 and then to 32 bit. The result should be something like (in GNU syntax)

vld1 d0[0], [<address>] @Now d0 = (*<addr>, *<addr+1>, *<addr+2>, *<addr+3>, <junk>, ... <junk> )
vmovl.u8 q0, d0 @Now q1 = (d0, d1) = ((uint16_t)*<addr>, ... (uint16_t)*<addr+3>, <junk>, ... <junk>)
vmovl.u16 q0, d2 @Now d0 = ((uint32_t)*<addr>, ... (uint32_t)*<addr+3>), d1 = (<junk>, ... <junk>)

If you can guarantee that <address> is 4-byte aligned, then write [<address>: 32] instead in the load instruction, to save a cycle or two. If you do that and the address isn't aligned, you'll get a fault, however.

Um, I just realized you want to use intrinsics, not assembly, so here's the same thing with intrinsics.

uint32x4_t v8; // Will actually hold 4 uint8_t
v8 = vld1_lane_u32(ptr, v8, 0);
const uint16x4_t v16 = vget_low_u16(vmovl_u8(vreinterpret_u8_u32(v8)));
const uint32x4_t v32 = vmovl_u16(v16);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文