我在哪里可以了解微控制器(例如 ARMv7)上 C 语言的最佳数学运算?

发布于 2024-10-26 19:54:34 字数 471 浏览 1 评论 0原文

我正在尝试优化某些功能,但我意识到我对某些事情需要多长时间几乎一无所知。

我可以在这里提出所有问题,但如果有人知道的话,我宁愿找到一篇关于该主题的好文章。

我正在使用 IAR 为 ATMEL SAM7S 处理器编写 C 程序。我有一个需要 500uS 左右的排序函数,我想看看是否可以加快速度。我也可以将其发布在这里,但我希望自己学习。

例如,两个 16 位整数相减比两个 32 位整数相减更快吗?这样的手术需要多长时间?只是一个周期还是多个周期?与减法相比,乘法需要多长时间?

有人知道哪里可以看吗?我尝试用谷歌搜索一些东西,但找不到任何有用的搜索词。

如果有人对我的具体功能有任何想法,我可以发布详细信息。我基本上试图将两个模拟值与校准值表中最接近的索引相匹配。现在,我迭代整个表并使用最小二乘法来确定最接近的匹配。它非常简单,我不确定是否有更快的方法而不对我的表应用一些额外的逻辑。但如果我至少知道某些事情花了多长时间,我可能可以自己优化它。

I'm trying to optimize some functions and I realized that I know next to nothing about how long certain things take.

I can ask all the questions here, but I'd rather just find a good article on the subject if anyone knows one.

I'm using IAR to write a program in C for an ATMEL SAM7S processor. I have a sort function that takes 500uS or so, and I wanted to see if I could speed it up. I could also just post it here but I was hoping to learn for myself.

Like, is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers? And how long does an operation like that take? Just one cycle or more? How long does multiplication take compared to subtraction?

Anyone know a place to look? I tried googling for some stuff but I couldn't come up with any useful search terms.

If anyone has an ideas on my specific function, I can post details. I'm basically trying to match two analog values to the closest index in a table of calibrated values. Right now I iterate through the whole table and use least squares to determine the closest match. Its pretty straightforward and I'm not sure there is a faster way without applying some extra logic to my table. But if I at least knew how long certain things took, I could probably optimize it myself.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷离° 2024-11-02 19:54:35

一个好的第一阶段可能是研究您正在编码的架构的汇编语言。

之后,您应该能够读取编译器生成的二进制文件,并最终比较计算机对两种不同实现的实际操作。

A good first stage could be to study the assembly language of the architecture you are coding for.

After you should be able to read the binary file generated by your compiler and finally compare what the computer will really have to do with two different implementation.

难以启齿的温柔 2024-11-02 19:54:35

您可以使用 SAM7S 中的计时器。在开始时读取一个计时器,并在 N 次搜索后读取它并减去以获得差值。尝试不同的算法,看看你会看到什么。

至于 16 位数学与 32 位数学,是的,可能存在巨大差异,但您必须查看您的架构。两个寄存器之间的减法运算将占用相同的一个时钟,无论是 16 位还是 32 位。但是来自 C 代码的变量最终可能会进入内存,你必须知道你是否有 16 位或 32 位数据总线(是的 ARM7 可以有 16 位总线,看看 GameBoy Advance,拇指代码的运行速度明显快于该处理器上的 ARM 代码)。在 16 位总线上读取或写入 32 位数字需要两倍的周期。不过,您可能没有 16 位总线。在 32 位处理器上使用 16 位变量会导致处理器必须添加额外的指令来剥离或扩展高位,以便 16 位变量的数学计算正确。这些额外的指令可能会导致性能下降,一个简单的减法可能是 3 或 4 条指令,最坏的情况现在可能是 5 或 6 条,如果它处于紧密循环中,这会很明显。一般来说,您希望使用与处理器寄存器大小相匹配的变量,在 32 位 ARM 上,即使您只数到 10,也尽可能使用 32 位变量。

希望我能理解您在这里试图解决的问题,如果不是的话让我知道,我将编辑/删除此响应:

根据测量中的位数,您正在做的事情的典型解决方案是使用查找表。我可以举一个例子,假设您正在进行 4 位测量并想要校准。将其称为 0 到 15。传感器的校准生成了一个数据点列表,可以说:

raw cal
0x03  16
0x08  31
0x14  49

我假设您在运行时所做的事情是这样的,如果传感器读取 0x5,您将查看列表以查找传感器读取的条目匹配或位于两个校准点之间。

搜索你会发现它在 0x03 和 0x08 之间,以从原始 0x05 测量中获取校准结果。

cal=  (((0x05-0x03)/(0x08-0x03))*(31-16)+16 = 22

其中有一个除法,这对大多数处理器来说是一个巨大的性能杀手,特别是 ARM7,因为它没有除法。不确定繁殖情况,但你也想避免像瘟疫这样的疾病。如果您考虑一下所有这些需要多少条指令。

相反,您要做的是采用运行时使用的算法,并在临时程序中从所有可能的输入生成所有可能的输出:

0  7
1  10
2  13
3  16
4  19
5  22
6  25
7  28
8  31
9  34
10 37
11 40
12 43
13 46
14 49
15 52

现在将其转换为运行时代码中的表:

unsigned char cal_table[ 16]={7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52};

然后运行时

cal = cal_table[raw&15];

实现此功能的代码如下所示:

ldr r3, =cal_table
and r0, r0, #15
ldrb    r0, [r3, r0]

需要 5 个时钟来执行。

搜索完表格后,从原始数据中找到 cal 的数学计算如下:

cal=  (((raw-xlo)/(xhi-xlo))*(yhi-ylo)+ylo);

看起来像这样:

docal:
    stmfd   sp!, {r3, r4, r5, lr}
    ldr r3, .L2
    ldr r5, .L2+4
    ldr lr, .L2+8
    ldr ip, [r5, #0]
    ldr r0, [r3, #0]
    ldr r1, [lr, #0]
    ldr r2, .L2+12
    rsb r0, ip, r0
    rsb r1, ip, r1
    ldr r5, [r2, #0]
    bl  __aeabi_uidiv
    ldr r4, .L2+16
    ldr r3, .L2+20
    ldr r4, [r4, #0]
    rsb r5, r4, r5
    mla r4, r0, r5, r4
    str r4, [r3, #0]
    ldmfd   sp!, {r3, r4, r5, pc}

除法函数即使不是更糟,也同样糟糕。查找表应该使您的代码运行速度提高数十倍。

查找表的问题是你需要用内存来换取性能,因此你必须有一个足够大的表来涵盖所有可能的输入。例如,12 位传感器会在查找表中提供多达 4096 个条目。如果你知道测量值永远不会低于 0x100,你可以将表设置为 0x1000 - 0x100 或 3840 个条目,并在查找之前从原始值中减去 0x100,在运行时交换几个指令以节省几百个字节记忆。

如果表太大,您可以尝试一些其他技巧,例如制作高位的查找表,其输出可能是校准表中预先计算的偏移量以开始搜索。因此,如果您有一个 12 位 ADC,但没有空间容纳 4096 个条目查找表,您可以制作一个 16 个条目查找表,获取 ADC 输出的高 4 位并用它来查找表。该表将包含校准表中要开始搜索的条目。假设您的校准表有这些条目:

....
entry 27 raw = 0x598 cal = 1005
entry 28 raw = 0x634 cal = 1600
entry 29 raw = 0x6AB cal = 1800
entry 30 raw = 0x777 cal = 2000

您的 16 深度查找表将有这些条目

...
[6] = 27;
[7] = 29;
...

您将如何使用它而

start = lut[raw>>8];
for(i=start;i<cal_tab_len;i++)
{
...
}

不是

for(i=0;i<cal_tabl_len;i++)
{
}

它可能会大大缩短在表中查找条目以供您执行数学运算所需的时间需要。

对于获取原始值并在运行时将其转换为校准值的特定问题,有许多类似的快捷方式。我不知道有哪一本书可以涵盖所有这些内容。采用哪条路径与您的处理器、内存系统和可用性以及数据的大小和性质有很大关系。如果您的处理器不支持除法(使用很少的时钟周期),您通常希望避免特别是除法和乘法。大多数处理器没有。 (是的,大多数程序员瞄准的一两个处理器确实具有单周期乘法和除法)。即使对于具有单周期乘法和除法的处理器,它们通常也必须用 C 库包装,以确定使用硬件指令执行操作是否安全,或者是否必须使用库进行综合。我上面提到,对于大多数变量,您希望匹配处理器的本机寄存器大小。如果您有定点乘法或除法,您通常会希望使用处理器寄存器大小的一半。对于 32 位处理器,除非您花时间详细检查指令,否则您可能希望将倍数限制为具有 32 位输出的 16 位输入,并除以具有 16 位输出的 32 位输入,并希望优化器可以帮助您出去。

再次强调,如果我错误地假设了您试图解决的问题是什么,请发表评论,我将编辑/修改此回复。

You can use the timers in your SAM7S. Read a timer on start, and read it after N number of searches and subtract to get the difference. Try different algorithms and see what you see.

As far as 16 bit math vs 32 bit math, yes there can be a huge difference, but you have to look at your architecture. A subtract operation between two registers will take the same one clock be it 16 bit or 32 bit. But coming from C code eventually the variables may land in memory and you have to know if you have a 16 bit or 32 bit data bus (yes ARM7s can have a 16 bit bus, look at the GameBoy Advance, thumb code runs significantly faster than ARM code on that processor). Takes twice as many cycles to read or write 32 bit numbers on a 16 but bus. You likely do NOT have a 16 bit bus though. Using 16 bit variables on a 32 bit processor causes the processor to have to add extra instructions to strip or extend the upper bits so that the math is correct for a 16 bit variable. Those extra instructions can cause performance hits, a simple subtract which might have been say 3 or 4 instructions worst case might now be 5 or 6 and that is noticeable if it is in a tight loop. Generally you want to use variables that match the processors register size, on a 32 bit ARM use 32 bit variables as much as possible even if you are only counting to 10.

Hopefully I am understanding the problem you are trying to solve here, if not let me know and I will edit/remove this response:

Depending on how many bits in your measurement the typical solution for what you are doing is to use a look up table. So that I can show an example lets say you are taking a 4 bit measurement that you want to calibrate. Call it 0 to 15. Calibration of the sensor generated a list of data points, lets say:

raw cal
0x03  16
0x08  31
0x14  49

I assume what you are doing runtime is something like this, if the sensor reads a 0x5 you would look through the list looking for entries your sensor reading matches or is between two of the cal points.

searching you will find it to be between 0x03 and 0x08 to get the calibrated result from the raw 0x05 measurement

cal=  (((0x05-0x03)/(0x08-0x03))*(31-16)+16 = 22

You have a divide in there which is a HUGE performance killer on most processors, ARM7 in particular as it doesnt have a divide. Not sure about the multiply but you want to avoid those like the plague as well. And if you think about how many instructions all of that takes.

Instead what you do is take the algorithm you are using run-time, and in an ad-hoc program generate all the possible outputs from all the possible inputs:

0  7
1  10
2  13
3  16
4  19
5  22
6  25
7  28
8  31
9  34
10 37
11 40
12 43
13 46
14 49
15 52

Now turn that into a table in your run-time code:

unsigned char cal_table[16]={7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52};

and then runtime

cal = cal_table[raw&15];

The code to implement this looks something like:

ldr r3, =cal_table
and r0, r0, #15
ldrb    r0, [r3, r0]

takes like 5 clocks to execute.

Just the math to find cal from raw after you have searched through the table:

cal=  (((raw-xlo)/(xhi-xlo))*(yhi-ylo)+ylo);

looks something like this:

docal:
    stmfd   sp!, {r3, r4, r5, lr}
    ldr r3, .L2
    ldr r5, .L2+4
    ldr lr, .L2+8
    ldr ip, [r5, #0]
    ldr r0, [r3, #0]
    ldr r1, [lr, #0]
    ldr r2, .L2+12
    rsb r0, ip, r0
    rsb r1, ip, r1
    ldr r5, [r2, #0]
    bl  __aeabi_uidiv
    ldr r4, .L2+16
    ldr r3, .L2+20
    ldr r4, [r4, #0]
    rsb r5, r4, r5
    mla r4, r0, r5, r4
    str r4, [r3, #0]
    ldmfd   sp!, {r3, r4, r5, pc}

And the divide function is as bad if not worse. The look up table should make your code run dozens of times faster.

The problem with look up tables is you trade memory for performance so you have to have a table big enough to cover all the possible inputs. A 12 bit sensor would give you as many as 4096 entries in the look up table for example. If say you knew the measurement would never be below 0x100 you could make the table 0x1000 - 0x100 or 3840 entries deep and subtract 0x100 from the raw value before looking it up, trading an couple of instructions at run time to save a few hundred bytes of memory.

If the table would be too big you could try some other tricks like make a look up table of the upper bits, and the output of that might be a pre-computed offset into the cal table to start your search. So if you had a 12 bit ADC, but didnt have room for a 4096 entry look up table you could make a 16 entry look up table, take the upper 4 bits of the ADC output and use it to look in the table. The table would contain the entry in the cal table to start searching. Say your cal table had these entries:

....
entry 27 raw = 0x598 cal = 1005
entry 28 raw = 0x634 cal = 1600
entry 29 raw = 0x6AB cal = 1800
entry 30 raw = 0x777 cal = 2000

your 16 deep look up table would then have these entries

...
[6] = 27;
[7] = 29;
...

And how you would use it is

start = lut[raw>>8];
for(i=start;i<cal_tab_len;i++)
{
...
}

instead of

for(i=0;i<cal_tabl_len;i++)
{
}

It could potentially greatly shorten the time it takes to find the entry in the table for you to perform the math needed.

For the particular problem of taking a raw value and turning that into a calibrated value at runtime, there are many many similar shortcuts. I dont know of one book that would cover them all. Which path to take has a lot to do with your processor, memory system and availability, and the size and nature of your data. You generally want to avoid divides in particular and multiples sometimes if your processor does not support them (using very few clock cycles). Most processors do not. (Yes, the one or two processors most programmers target, do have a single cycle multiply and divide). Even for processors that have a single cycle multiply and divide they often have to be wrapped with a C library to decide if it is safe to perform the operation with the hardware instruction or if it has to be synthesized with a library. I mentioned above that for most variables you want to match the native register size of the processor. If you have fixed point multiplies or divides you will often want to use half the register size of the processor. A 32 bit processor, unless you take the time to examine the instructions in detail, you probably want to limit your multiples to 16 bit inputs with a 32 bit output and divides to 32 bit inputs with a 16 bit output and hope the optimizer helps you out.

Again, If I have assumed incorrectly what the problem you were trying to solve is please comment and I will edit/modify this response.

回心转意 2024-11-02 19:54:34

两个 16 位整数相减比两个 32 位整数相减更快吗?

不是在具有本机 32 位寄存器的 ARM 架构上,不是。

有人知道哪里可以看吗?

指令周期计时的规范位置是芯片实现的特定架构的技术参考手册,例如。 ARM7TDMI;简单的 alu 操作的时间这里是的,这是一个周期。如果您还不太熟悉说明集,那么这不是一个友好的文档,但是......

现在我遍历整个表

在这里查看算法优化(例如索引表、按一个坐标排序以缩小范围等)比担心指令级微优化要好得多。

is it any faster to subtract two 16 bit integers than it is to subtract two 32 bit integers?

Not on an ARM architecture which has native 32-bit registers, no.

Anyone know a place to look?

The canonical place for instruction cycle timings would be the Tech Ref Manual for the particular architecture your chip implements, eg. ARM7TDMI; timings for simple alu ops here and yes, it is one cycle. This is not friendly doc to be reading if you're not already well familiar with the instruction set, though...

Right now I iterate through the whole table

You'll be much better off looking at algorithmic optimisations here (eg indexing the table, sorting by one co-ordinate to narrow it down, etc) than worrying about instruction-level micro-optimisations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文