将查找表优化到简单的Alu

发布于 2025-01-30 05:14:07 字数 3103 浏览 2 评论 0原文

问题

说您有一个简单的函数，可以根据外观表返回值：例如：

请参阅“关于假设”的编辑。

uint32_t
lookup0(uint32_t r) {
    static const uint32_t tbl[] = { 0, 1, 2, 3 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r`.  */
    return tbl[r];
}


uint32_t
lookup1(uint32_t r) {
    static const uint32_t tbl[] = { 0, 0, 1, 1 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r / 2`.  */
    return tbl[r];
}

是否有任何超优化基础架构或算法可以从查找表到优化的ALU实现。

动机

动机是我为Numa机器建造了一些锁，并希望能够一般配置我的代码。很常见的是，在numa锁中，您需要进行cpu_id - ＆gt; numa_node。显然，我可以在配置过程中设置查找表，但是由于我为每一滴内存带宽而战，因此我希望一般能够达到能够涵盖最多最多布局的解决方案。

查看现代编译器的表现：

都不clang clang gcc 目前要这样做。

如果您将其重写为switch> switch/案例语句，则可以获得Lookup0。

lookup0(unsigned int):                            # @lookup0(unsigned int)
        movl    %edi, %eax
        movl    lookup0(unsigned int)::tbl(,%rax,4), %eax
        retq

...

case0(unsigned int):                              # @case0(unsigned int)
        movl    %edi, %eax
        retq

但无法获得Lookup1。

lookup1(unsigned int):                            # @lookup1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

...

case1(unsigned int):                              # @case1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

海湾合作委员会也无法得到。

lookup0(unsigned int):
        movl    %edi, %edi
        movl    lookup0(unsigned int)::tbl(,%rdi,4), %eax
        ret
lookup1(unsigned int):
        movl    %edi, %edi
        movl    lookup1(unsigned int)::tbl(,%rdi,4), %eax
        ret
case0(unsigned int):
        leal    -1(%rdi), %eax
        cmpl    $2, %eax
        movl    $0, %eax
        cmovbe  %edi, %eax
        ret
case1(unsigned int):
        subl    $2, %edi
        xorl    %eax, %eax
        cmpl    $1, %edi
        setbe   %al
        ret

我想我可以使用一些自定义的蛮力方法来介绍相当多的必要案例，但希望这是一个解决问题的问题。

编辑：

唯一真正的假设是：

所有输入均为LUT中的索引。
所有值都是积极的（认为使事情变得更容易），并且对于任何在线上的sys-config来说都是正确的。
（edit4）将增加一个假设。 LUT是密集的。也就是说，它涵盖了一个范围[＆lt; low_bound＆gt;，＆lt; bound_bound＆gt;]，但没有该范围之外。

就我的CPU拓扑而言，我通常会期望sizeof（lut）＆gt; =＆lt; max_value_in_in_lut＆gt;，但这是我给出的一个示例，并且会有一些反例。

EDIT2：

我编写了一个非常简单的优化器，该功能为我测试的CPU拓扑做出了合理的工作，在这里。但是显然可能会好多了。

EDIT3：

关于问题/初始示例似乎有些混乱（我应该更清楚）。

示例Lookup0/Lookup1是任意的。我希望找到一个可以扩展超过4个索引且具有不同值的解决方案。

我想到的用例是CPU拓扑结构，因此〜256-1024是我期望上部尺寸的地方，但对于通用的LUT来说，显然可能会变得更大。

原文

Question

Say you have a simple function that returns a value based on a look table for example:

See edit about assumptions.

uint32_t
lookup0(uint32_t r) {
    static const uint32_t tbl[] = { 0, 1, 2, 3 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r`.  */
    return tbl[r];
}


uint32_t
lookup1(uint32_t r) {
    static const uint32_t tbl[] = { 0, 0, 1, 1 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r / 2`.  */
    return tbl[r];
}

Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.

Motivation

The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.

Looking at how modern compilers do:

Neither clang or gcc are able to do this at the moment.

Clang is able to get lookup0 if you rewrite it as a switch/case statement.

lookup0(unsigned int):                            # @lookup0(unsigned int)
        movl    %edi, %eax
        movl    lookup0(unsigned int)::tbl(,%rax,4), %eax
        retq

...

case0(unsigned int):                              # @case0(unsigned int)
        movl    %edi, %eax
        retq

but can't get lookup1.

lookup1(unsigned int):                            # @lookup1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

...

case1(unsigned int):                              # @case1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

Gcc cant get either.

lookup0(unsigned int):
        movl    %edi, %edi
        movl    lookup0(unsigned int)::tbl(,%rdi,4), %eax
        ret
lookup1(unsigned int):
        movl    %edi, %edi
        movl    lookup1(unsigned int)::tbl(,%rdi,4), %eax
        ret
case0(unsigned int):
        leal    -1(%rdi), %eax
        cmpl    $2, %eax
        movl    $0, %eax
        cmovbe  %edi, %eax
        ret
case1(unsigned int):
        subl    $2, %edi
        xorl    %eax, %eax
        cmpl    $1, %edi
        setbe   %al
        ret

I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.

Edit:

The only true assumption is:

All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.

In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.

Edit2:

I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.

Edit3:

There seems to be some confusion about the question/initial example (I should have been clearer).

The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.

The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撧情箌佬 2025-02-06 05:14:07

我知道的最好的“通用”解决方案是：

int compute(int r)
{
    static const int T[] = {0,0,1,1};
    const int lut_size = sizeof(T) / sizeof(T[0]);

    int result = 0;

    for(int i=0 ; i<lut_size ; ++i)
        result += (r == i) * T[i];

    return result;
}

在-O3中，GCC和Clang展开循环，传播常数，并生成类似于以下的中间代码：

int compute(int r)
{
    return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}

GCC/Clang Optimizers知道该乘法可以用有条件的移动替换（因为开发人员经常将其用作指导编译器生成没有条件分支的组装代码的诀窍）。

最终的组件是clang的以下组件：

compute:
        xor     ecx, ecx
        cmp     edi, 2
        sete    cl
        xor     eax, eax
        cmp     edi, 3
        sete    al
        add     eax, ecx
        ret

适用于GCC的组件。没有分支和内存访问（至少只要值很小）。乘以小值的乘法也将用快速lea指令代替。

可以提供更完整的测试 on Godbolt 。

请注意，此方法应适用于较大的桌子，但是如果表太大，则循环不会自动展开。您可以告诉编译器通过编译标志，使用更具侵略性的展开。话虽这么说，如果LUT很大，LUT可能会更快，因为在此病理情况下，加载和执行的大量代码很慢。

The best "generic" solution I am aware of is the following:

int compute(int r)
{
    static const int T[] = {0,0,1,1};
    const int lut_size = sizeof(T) / sizeof(T[0]);

    int result = 0;

    for(int i=0 ; i<lut_size ; ++i)
        result += (r == i) * T[i];

    return result;
}

In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:

int compute(int r)
{
    return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}

GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).

The resulting assembly is the following for Clang:

compute:
        xor     ecx, ecx
        cmp     edi, 2
        sete    cl
        xor     eax, eax
        cmp     edi, 3
        sete    al
        add     eax, ecx
        ret

The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.

A more complete test is available on Godbolt.

Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.

回复收藏 0 原文

枫以 2025-02-06 05:14:07

您可以将阵列打包到一个长整数中，并使用bitshifts和anding提取结果。

例如，对于表{2,0,3,1}可以处理：

uint32_t lookup0(uint32_t r) {
    static const uint32_t tbl = (2u <<  0) | (0u <<  8) |
                                (3u << 16) | (1u << 24);
    return (tbl >> (8 * r)) & 0xff;
}

它会产生相对良好的汇编：

lookup0:                                # @lookup0
        lea     ecx, [8*rdi]
        mov     eax, 16973826
        shr     eax, cl
        movzx   eax, al
        ret

不是完美而是无分支，没有间接。

该方法非常通用，可以通过同时“查找”多个输入来支持矢量化。

有一些技巧可以允许处理较大的数组，例如使用较长的整数（即uint64_t或__ uint128_t extension）。
其他方法是在高和低字节之类的数组中拆分价值位，查找它们并使用位操作组合。

You could pack the array into a long integer and use bitshifts and anding to extract the result.

For example for the table {2,0,3,1} could be handled with:

uint32_t lookup0(uint32_t r) {
    static const uint32_t tbl = (2u <<  0) | (0u <<  8) |
                                (3u << 16) | (1u << 24);
    return (tbl >> (8 * r)) & 0xff;
}

It produces relatively nice assembly:

lookup0:                                # @lookup0
        lea     ecx, [8*rdi]
        mov     eax, 16973826
        shr     eax, cl
        movzx   eax, al
        ret

Not perfect but branchless and with no indirection.

This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.

There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.

回复收藏 0 原文

~没有更多了~