将查找表优化到简单的Alu
问题
说您有一个简单的函数,可以根据外观表返回值:例如:
请参阅“关于假设”的编辑。
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
是否有任何超优化基础架构或算法可以从查找表到优化的ALU实现。
动机
动机是我为Numa机器建造了一些锁,并希望能够一般配置我的代码。很常见的是,在numa锁中,您需要进行cpu_id
- > numa_node
。显然,我可以在配置过程中设置查找表,但是由于我为每一滴内存带宽而战,因此我希望一般能够达到能够涵盖最多最多布局的解决方案。
查看现代编译器的表现:
如果您将其重写为switch> switch
/案例
语句,则可以获得Lookup0
。
lookup0(unsigned int): # @lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # @case0(unsigned int)
movl %edi, %eax
retq
但无法获得Lookup1
。
lookup1(unsigned int): # @lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # @case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
海湾合作委员会也无法得到。
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
我想我可以使用一些自定义的蛮力方法来介绍相当多的必要案例,但希望这是一个解决问题的问题。
编辑:
唯一真正的假设是:
- 所有输入均为LUT中的索引。
- 所有值都是积极的(认为使事情变得更容易),并且对于任何在线上的sys-config来说都是正确的。
- (edit4)将增加一个假设。 LUT是密集的。也就是说,它涵盖了一个范围
[< low_bound>,< bound_bound>]
,但没有该范围之外。
就我的CPU拓扑而言,我通常会期望sizeof(lut)> =< max_value_in_in_lut>
,但这是我给出的一个示例,并且会有一些反例。
EDIT2:
我编写了一个非常简单的优化器,该功能为我测试的CPU拓扑做出了合理的工作,在这里。但是显然可能会好多了。
EDIT3:
关于问题/初始示例似乎有些混乱(我应该更清楚)。
示例Lookup0
/Lookup1
是任意的。我希望找到一个可以扩展超过4个索引且具有不同值的解决方案。
我想到的用例是CPU拓扑结构,因此〜256-1024是我期望上部尺寸的地方,但对于通用的LUT来说,显然可能会变得更大。
Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id
-> numa_node
. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang
or gcc
are able to do this at the moment.
Clang is able to get lookup0
if you rewrite it as a switch
/case
statement.
lookup0(unsigned int): # @lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # @case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1
.
lookup1(unsigned int): # @lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # @case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
- All inputs are have an index in the LUT.
- All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
- (Edit4) Would add one more assumption. The LUT is dense. That is it covers a range
[<low_bound>, <bound_bound>]
but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut>
but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0
/lookup1
are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我知道的最好的“通用”解决方案是:
在
-O3
中,GCC和Clang展开循环,传播常数,并生成类似于以下的中间代码:GCC/Clang Optimizers知道该乘法可以用有条件的移动替换(因为开发人员经常将其用作指导编译器生成没有条件分支的组装代码的诀窍)。
最终的组件是clang的以下组件:
适用于GCC的组件。没有分支和内存访问(至少只要值很小)。乘以小值的乘法也将用快速
lea
指令代替。可以提供更完整的测试 on Godbolt 。
请注意,此方法应适用于较大的桌子,但是如果表太大,则循环不会自动展开。您可以告诉编译器通过编译标志,使用更具侵略性的展开。话虽这么说,如果LUT很大,LUT可能会更快,因为在此病理情况下,加载和执行的大量代码很慢。
The best "generic" solution I am aware of is the following:
In
-O3
, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast
lea
instruction.A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.
您可以将阵列打包到一个长整数中,并使用bitshifts和anding提取结果。
例如,对于表
{2,0,3,1}
可以处理:它会产生相对良好的汇编:
不是完美而是无分支,没有间接。
该方法非常通用,可以通过同时“查找”多个输入来支持矢量化。
有一些技巧可以允许处理较大的数组,例如使用较长的整数(即
uint64_t
或__ uint128_t
extension)。其他方法是在高和低字节之类的数组中拆分价值位,查找它们并使用位操作组合。
You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table
{2,0,3,1}
could be handled with:It produces relatively nice assembly:
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e.
uint64_t
or__uint128_t
extension).Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.