当前位置：文江博客话题详情

固定长度 6 int 数组的最快排序

发布于 2024-09-01 09:03:44 字数 6764 浏览 11 评论 0原文

回答另一个 Stack Overflow 问题（这个one）我偶然发现了一个有趣的子问题。对 6 个整数的数组进行排序的最快方法是什么？

由于问题的级别非常低：

我们不能假设库可用（并且调用本身有其成本），只能使用普通 C
来避免清空指令管道（其成本非常）也许应该最大限度地减少分支、跳转和所有其他类型的控制流中断（例如隐藏在 && 或 || 中的序列点后面的控制流）。
空间有限，最小化寄存器和内存使用是一个问题，理想情况下，就地排序可能是最好的。

实际上，这个问题是一种高尔夫，其目标不是最小化源长度而是最小化执行时间。我将其称为“Zening”代码，如书名中所用Zen of Code optimization 作者：Michael Abrash 及其续集。

至于为什么有趣，有几个层次：

这个例子很简单，易于理解和测量，没有涉及太多的C技能，
它显示了选择一个好的算法对问题的影响，但也显示了编译器和底层硬件的影响。

这是我的参考（简单的，未优化的）实现和我的测试集。

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

原始结果

随着变体数量越来越大，我将它们全部收集在一个测试套件中，可以在找到在这里。感谢 Kevin Stock，实际使用的测试比上面显示的测试要简单一些。您可以在自己的环境中编译并执行它。我对不同目标架构/编译器上的行为非常感兴趣。（好吧，伙计们，把它写在答案中，我将为新结果集的每个贡献者+1）。

一年前，我向 Daniel Stutzbach（高尔夫）给出了答案，因为他是当时最快解决方案（排序网络）的来源。

Linux 64 位、gcc 4.6.1 64 位、Intel Core 2 Duo E8400、-O2

直接调用 qsort 库函数：689.38
Naive 实现（插入排序）：285.70
插入排序 (Daniel Stutzbach)：142.12
展开的插入排序：125.47
排名顺序：102.26
带寄存器的排名顺序：58.03
排序网络 (Daniel Stutzbach)：111.68
排序网络 (Paul R)：66.36
带快速交换的排序网络 12：58.86
排序网络 12 重新排序的交换：53.74
排序网络 1 2 重新排序简单交换：31.54
带快速交换的重新排序网络：31.54
带快速交换的重新排序网络 V2：33.63
内联冒泡排序 (Paolo Bonzini)：48.85
展开插入排序 (Paolo Bonzini)：75.30

Linux 64 位，gcc 4.6 .1 64 位，Intel Core 2 Duo E8400，-O1

直接调用 qsort 库函数：705.93
Naive 实现（插入排序）：135.60
插入排序 (Daniel Stutzbach)：142.11
展开插入排序：126.75
排名顺序：46.42
带寄存器的排名顺序：43.58
排序网络（Daniel Stutzbach）：115.57
排序网络（Paul R）：64.44
带快速交换的排序网络 12：61.98
排序网络 12 重新排序的交换：54.67
排序网络 12 重新排序的简单交换：31.54
重新排序的排序网络（带快速交换）：31.54快速交换：31.24
带快速交换的重新排序网络 V2：33.07
内联冒泡排序 (Paolo Bonzini)：45.79
展开插入排序 (Paolo Bonzini)：80.15

我同时包含了 -O1 和 -O2 结果，因为令人惊讶的是，对于几个程序来说，O2 效率低于O1。请问具体是通过什么优化达到这个效果的？

对建议解决方案的评论

插入排序（Daniel Stutzbach）

正如预期的那样，最小化分支确实是一个好主意。

排序网络 (Daniel Stutzbach)

比插入排序更好。我想知道主要效果是否不是通过避免外部循环而获得的。我尝试通过展开插入排序进行检查，实际上我们得到了大致相同的数字（代码位于此处）。

排序网络 (Paul R)

迄今为止最好的。我用来测试的实际代码位于此处。还不知道为什么它的速度几乎是其他排序网络实现的两倍。参数传递？快最大 ?

使用快速交换对 12 个交换进行排序

根据 Daniel Stutzbach 的建议，我将他的 12 个交换排序网络与无分支快速交换结合起来（代码为此处）。它确实更快，是迄今为止最好的，并且利润率很小（大约 5％），正如使用 1 次交换所预期的那样。

值得注意的是，无分支交换似乎比 PPC 架构上使用 if 的简单交换效率低得多（4 倍）。

调用库 qsort

为了提供另一个参考点，我也尝试按照建议调用库 qsort （代码是 here ）。正如预期的那样，它要慢得多：慢了 10 到 30 倍...随着新测试套件的出现，主要问题似乎是第一次调用后库的初始加载，并且与其他库相比并没有那么差版本。在我的 Linux 上，速度只慢了 3 到 20 倍。在其他人用于测试的某些架构上，它似乎甚至更快（我对此感到非常惊讶，因为库 qsort 使用更复杂的 API）。

排名顺序

Rex Kerr 提出了另一种完全不同的方法：对于数组的每个项目，直接计算其最终位置。这是有效的，因为计算排序不需要分支。这种方法的缺点是它需要三倍于数组的内存量（一份数组和变量的副本来存储排序顺序）。性能结果非常令人惊讶（并且有趣）。在我的具有 32 位操作系统和 Intel Core2 Quad E8300 的参考架构上，周期计数略低于 1000（就像使用分支交换对网络进行排序一样）。但当在我的 64 位机器（Intel Core2 Duo）上编译和执行时，它的性能要好得多：它成为迄今为止最快的。我终于找到了真正的原因。我的 32 位盒子使用 gcc 4.4.1，我的 64 位盒子使用 gcc 4.4.3，最后一个似乎在优化这个特定代码方面要好得多（其他建议几乎没有什么区别）。

更新：

如上面公布的数据所示，gcc 的更高版本仍然增强了这种效果，并且排名顺序始终是任何其他替代方案的两倍。

使用重新排序的交换对 Networks 12 进行排序

Rex Kerr 提案与 gcc 4.4.3 的惊人效率让我想知道：一个内存使用量是无分支排序网络 3 倍的程序如何能够比无分支排序网络更快？我的假设是，它具有较少的先写后读的依赖性，从而可以更好地使用 x86 的超标量指令调度程序。这给了我一个想法：重新排序交换以最大限度地减少写入后读取的依赖性。更简单地说：当你执行 SWAP(1, 2); 时SWAP(0, 2); 在执行第二次交换之前，您必须等待第一次交换完成，因为两者都访问公共内存单元。当你执行 SWAP(1, 2); 时SWAP(4, 5)；处理器可以并行执行两者。我尝试了一下，它按预期工作，排序网络的运行速度提高了约 10%。

使用简单交换对 Networks 12 进行排序

在最初的帖子 Steinar H. Gunderson 提出一年后，我们不应该试图超越编译器并保持交换代码简单。这确实是一个好主意，因为生成的代码大约快了 40%！他还提出了使用 x86 内联汇编代码手动优化的交换，仍然可以节省更多周期。最令人惊讶的（它涉及程序员心理学）是一年前没有人尝试过该版本的交换。我用来测试的代码位于此处。其他人建议使用其他方法来编写 C 快速交换，但它产生的性能与使用像样的编译器的简单交换相同。

现在的“最佳”代码如下：

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

如果我们相信我们的测试集（是的，它相当差，它的唯一好处是简短、简单且易于理解我们正在测量的内容），则测试集的平均周期数一种排序的结果代码低于 40 个周期（执行了 6 个测试）。这使得每次交换平均为 4 个周期。我称其速度快得惊人。还有其他可能的改进吗？

原文

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 integers?

As the question is very low level:

we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千と千尋 2024-09-08 09:03:45

虽然我真的很喜欢提供的交换宏：

#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }

我看到了一个改进（一个好的编译器可能会做出这样的改进）：

#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }

我们注意到 min 和 max 是如何工作的，并显式地提取公共子表达式。这完全消除了最小和最大宏。

While I really like the swap macro provided:

#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }

I see an improvement (which a good compiler might make):

#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }

We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

回复收藏 0 原文

多孤肩上扛 2024-09-08 09:03:45

在没有进行基准测试和查看实际编译器生成的程序集的情况下，切勿优化最小/最大。如果我让 GCC 使用条件移动指令优化 min，我会得到 33% 的加速：（

#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }

测试代码中的 280 个周期与 420 个周期相比）。用 ?: 做 max 或多或少是相同的，几乎迷失在噪音中，但上面的速度更快一点。 GCC 和 Clang 的这种交换速度更快。

编译器在寄存器分配和别名分析方面也做得非常出色，有效地将 d[x] 预先移动到局部变量中，并且仅在最后复制回内存。事实上，它们比完全使用局部变量更好（例如 d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5])。我写这篇文章是因为您假设进行了强大的优化，但试图在最小/最大方面胜过编译器。 :)

顺便说一句，我尝试了 Clang 和 GCC。他们做了相同的优化，但由于调度差异，两者的结果有一些差异，不能真正说哪个更快或更慢。 GCC 在排序网络上更快，Clang 在二次排序上更快。

为了完整起见，展开的冒泡排序和插入排序也是可能的。这是冒泡排序：

SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);

这是插入排序：

//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
    int t;
    t = d[1]; ITER(0);
    t = d[2]; ITER(1); ITER(0);
    t = d[3]; ITER(2); ITER(1); ITER(0);
    t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
    t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);

这种插入排序比 Daniel Stutzbach 的速度更快，并且在 GPU 或带有预测的计算机上特别好，因为 ITER 仅需 3 条指令即可完成（而 SWAP 则需要 4 条指令）。例如，这里是 t = d[2];国际热核聚变实验堆(1)； ITER(0); ARM 汇编中的行：

    MOV    r6, r2
    CMP    r6, r1
    MOVLT  r2, r1
    MOVLT  r1, r6
    CMP    r6, r0
    MOVLT  r1, r0
    MOVLT  r0, r6

对于 6 个元素，插入排序与排序网络竞争（12 次交换与 15 次迭代平衡 4 条指令/交换与 3 条指令/迭代）；气泡排序当然比较慢。但当大小增加时，情况就不再是这样了，因为插入排序是 O(n^2)，而排序网络是 O(n log n)。

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:

#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }

(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.

Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)

By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.

Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:

SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);

and here is the insertion sort:

//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
    int t;
    t = d[1]; ITER(0);
    t = d[2]; ITER(1); ITER(0);
    t = d[3]; ITER(2); ITER(1); ITER(0);
    t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
    t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);

This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:

    MOV    r6, r2
    CMP    r6, r1
    MOVLT  r2, r1
    MOVLT  r1, r6
    CMP    r6, r0
    MOVLT  r1, r0
    MOVLT  r0, r6

For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

回复收藏 0 原文

小草泠泠 2024-09-08 09:03:45

我将测试套件移植到我无法识别的 PPC 架构机器上（不必接触代码，只需增加测试的迭代次数，使用 8 个测试用例以避免 mods 污染结果并替换 x86 特定的 rdtsc）：

< strong>直接调用 qsort 库函数 : 101

简单实现（插入排序） : 299

插入排序 (Daniel Stutzbach) : 108

展开插入排序 : 51

排序网络 (Daniel Stutzbach) : 26

排序网络 (Paul R) : 85

使用快速交换对网络 12 进行排序 : 117

排序网络 12 重新排序交换：116

排名顺序：56

回复收藏 0 原文

拥抱我好吗 2024-09-08 09:03:45

XOR 交换在您的交换函数中可能很有用。

void xorSwap (int *x, int *y) {
     if (*x != *y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

if 可能会导致代码中出现太多分歧，但如果您保证所有整数都是唯一的，这可能会很方便。

An XOR swap may be useful in your swapping functions.

void xorSwap (int *x, int *y) {
     if (*x != *y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

回复收藏 0 原文

将军与妓 2024-09-08 09:03:45

期待亲自尝试并从这些示例中学习，但首先是我的 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM 的一些时序。（我从 http://www.mcs 借用了一个类似 rdtsc 的计时器用于 PPC。 anl.gov/~kazutomo/rdtsc.html 的计时。）我运行了该程序几次，绝对结果各不相同，但始终最快的测试是“插入排序（Daniel Stutzbach）”，其中“插入排序”展开”紧随其后。

这是最后一组时间：

**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)**     : 85
**Insertion Sort Unrolled**               : 97
**Sorting Networks (Daniel Stutzbach)**   : 457
**Sorting Networks (Paul R)**             : 179
**Sorting Networks 12 with Fast Swap**    : 238
**Sorting Networks 12 reordered Swap**    : 236
**Rank Order**                            : 116

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.

Here's the last set of times:

**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)**     : 85
**Insertion Sort Unrolled**               : 97
**Sorting Networks (Daniel Stutzbach)**   : 457
**Sorting Networks (Paul R)**             : 179
**Sorting Networks 12 with Fast Swap**    : 238
**Sorting Networks 12 reordered Swap**    : 236
**Rank Order**                            : 116

回复收藏 0 原文

伪心 2024-09-08 09:03:45

这是我对此线程的贡献：针对包含唯一值的 6 成员 int 向量 (valp) 的优化 1, 4 间隙 shellsort。

void shellsort (int *valp)
{      
  int c,a,*cp,*ip=valp,*ep=valp+5;

  c=*valp;    a=*(valp+4);if (c>a) {*valp=    a;*(valp+4)=c;}
  c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}

  cp=ip;    
  do
  {
    c=*cp;
    a=*(cp+1);
    do
    {
      if (c<a) break;

      *cp=a;
      *(cp+1)=c;
      cp-=1;
      c=*cp;
    } while (cp>=valp);
    ip+=1;
    cp=ip;
  } while (ip<ep);
}

在我的配备双核 Athlon M300 @ 2 Ghz（DDR2 内存）的 HP dv7-3010so 笔记本电脑上，它以 165 个时钟周期执行。这是根据每个独特序列的计时计算出的平均值（总共 6！/720）。使用 OpenWatcom 1.8 编译为 Win32。该循环本质上是一种插入排序，长度为 16 条指令/37 字节。

我没有 64 位环境可供编译。

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.

void shellsort (int *valp)
{      
  int c,a,*cp,*ip=valp,*ep=valp+5;

  c=*valp;    a=*(valp+4);if (c>a) {*valp=    a;*(valp+4)=c;}
  c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}

  cp=ip;    
  do
  {
    c=*cp;
    a=*(cp+1);
    do
    {
      if (c<a) break;

      *cp=a;
      *(cp+1)=c;
      cp-=1;
      c=*cp;
    } while (cp>=valp);
    ip+=1;
    cp=ip;
  } while (ip<ep);
}

On my HP dv7-3010so laptop with a dual-core Athlon M300 @ 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.

I do not have a 64-bit environment to compile on.

回复收藏 0 原文

楠木可依 2024-09-08 09:03:45

如果插入排序在这里具有相当的竞争力，我建议尝试 shellsort。恐怕 6 个元素可能太少了，无法跻身最佳之列，但可能值得一试。

示例代码，未经测试、未经调试等。您想要调整 inc = 4 和 inc -= 3 序列以找到最佳值（例如尝试 inc = 2、inc -= 1）。

static __inline__ int sort6(int * d) {
    char j, i;
    int tmp;
    for (inc = 4; inc > 0; inc -= 3) {
        for (i = inc; i < 5; i++) {
            tmp = a[i];
            j = i;
            while (j >= inc && a[j - inc] > tmp) {
                a[j] = a[j - inc];
                j -= inc;
            }
            a[j] = tmp;
        }
    }
}

我不认为这会获胜，但如果有人发布有关对 10 个元素进行排序的问题，谁知道呢......

根据维基百科，这甚至可以与排序网络结合使用：
普拉特，V (1979)。希尔排序和排序网络（计算机科学领域的杰出论文）。花环。 ISBN 0-824-04406-1

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.

Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).

static __inline__ int sort6(int * d) {
    char j, i;
    int tmp;
    for (inc = 4; inc > 0; inc -= 3) {
        for (i = inc; i < 5; i++) {
            tmp = a[i];
            j = i;
            while (j >= inc && a[j - inc] > tmp) {
                a[j] = a[j - inc];
                j -= inc;
            }
            a[j] = tmp;
        }
    }
}

I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...

According to Wikipedia this can even be combined with sorting networks:
Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

回复收藏 0 原文

诺曦 2024-09-08 09:03:45

我知道我已经迟到了，但我有兴趣尝试一些不同的解决方案。首先，我清理了该粘贴，对其进行编译，并将其放入存储库中。我将一些不需要的解决方案保留为死胡同，以便其他人不会尝试。其中是我的第一个解决方案，它试图确保 x1>x2 被计算一次。优化后，它并不比其他简单版本快。

我添加了排名顺序排序的循环版本，因为我自己的研究应用是对 2-8 个项目进行排序，所以由于参数数量可变，因此需要一个循环。这也是我忽略排序网络解决方案的原因。

测试代码没有测试重复项是否被正确处理，因此虽然现有的解决方案都是正确的，但我在测试代码中添加了一个特殊情况，以确保重复项得到正确处理。

然后，我编写了一个完全在 AVX 寄存器中的插入排序。在我的机器上，它比其他插入排序快 25%，但比排名顺序慢 100%。我这样做纯粹是为了实验，并没有期望由于插入排序中的分支而变得更好。

static inline void sort6_insertion_sort_avx(int* d) {
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
    __m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
    __m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
            INT_MAX, INT_MAX, INT_MAX, INT_MAX);
    __m256i val, gt, permute;
    unsigned j;
     // 8 / 32 = 2^-2
#define ITER(I) \
        val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
        gt =  _mm256_cmpgt_epi32(sorted, val);\
        permute =  _mm256_blendv_epi8(index, shlpermute, gt);\
        j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
        sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
                val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
    ITER(1);
    ITER(2);
    ITER(3);
    ITER(4);
    ITER(5);
    int x[8];
    _mm256_storeu_si256((__m256i*)x, sorted);
    d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}

然后，我使用 AVX 编写了一个排序排序。这与其他排序解决方案的速度相匹配，但并不更快。这里的问题是我只能用 AVX 计算索引，然后我必须制作一个索引表。这是因为计算是基于目的地而不是基于源的。请参阅从基于源的索引转换为基于目标的索引

static inline void sort6_rank_order_avx(int* d) {
    __m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
    __m256i one = _mm256_set1_epi32(1);
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
    __m256i rot = src;
    __m256i index = _mm256_setzero_si256();
    __m256i gt, permute;
    __m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
    __m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
    __m256i srcIx = dstIx;
    __m256i eq = one;
    __m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
    rot = _mm256_permutevar8x32_epi32(rot, ror);\
    gt = _mm256_cmpgt_epi32(src, rot);\
    index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
    index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
                _mm256_cmpeq_epi32(src, rot)));\
    eq = _mm256_insert_epi32(eq, 0, I)
    INC(0);
    INC(1);
    INC(2);
    INC(3);
    INC(4);
    int e[6];
    e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
    int i[8];
    _mm256_storeu_si256((__m256i*)i, index);
    d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}

可以在此处找到该存储库： https://github.com/eyepatchParrot/sort6/

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.

I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.

The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.

Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.

static inline void sort6_insertion_sort_avx(int* d) {
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
    __m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
    __m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
            INT_MAX, INT_MAX, INT_MAX, INT_MAX);
    __m256i val, gt, permute;
    unsigned j;
     // 8 / 32 = 2^-2
#define ITER(I) \
        val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
        gt =  _mm256_cmpgt_epi32(sorted, val);\
        permute =  _mm256_blendv_epi8(index, shlpermute, gt);\
        j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
        sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
                val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
    ITER(1);
    ITER(2);
    ITER(3);
    ITER(4);
    ITER(5);
    int x[8];
    _mm256_storeu_si256((__m256i*)x, sorted);
    d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}

Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices

static inline void sort6_rank_order_avx(int* d) {
    __m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
    __m256i one = _mm256_set1_epi32(1);
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
    __m256i rot = src;
    __m256i index = _mm256_setzero_si256();
    __m256i gt, permute;
    __m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
    __m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
    __m256i srcIx = dstIx;
    __m256i eq = one;
    __m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
    rot = _mm256_permutevar8x32_epi32(rot, ror);\
    gt = _mm256_cmpgt_epi32(src, rot);\
    index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
    index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
                _mm256_cmpeq_epi32(src, rot)));\
    eq = _mm256_insert_epi32(eq, 0, I)
    INC(0);
    INC(1);
    INC(2);
    INC(3);
    INC(4);
    int e[6];
    e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
    int i[8];
    _mm256_storeu_si256((__m256i*)i, index);
    d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}

The repo can be found here: https://github.com/eyepatchParrot/sort6/

回复收藏 0 原文

汹涌人海 2024-09-08 09:03:45

这个问题已经变得很老了，但这些天我实际上必须解决同样的问题：对小数组进行排序的快速算法。我认为分享我的知识是个好主意。虽然我最初开始使用排序网络，但我最终设法找到了其他算法，这些算法对 6 个值的每个排列进行排序所执行的比较总数小于排序网络，也小于插入排序。我没有计算交换的次数；我希望它大致相当（有时可能更高一点）。

算法 sort6 使用算法 sort4，而算法 sort4 使用算法 sort3。下面是一些轻量级 C++ 形式的实现（原始版本是重模板的，因此它可以与任何随机访问迭代器和任何合适的比较函数一起使用）。

对 3 个值进行排序

以下算法是展开插入排序。当必须执行两次交换（6 次赋值）时，它会使用 4 次赋值：

void sort3(int* array)
{
    if (array[1] < array[0]) {
        if (array[2] < array[0]) {
            if (array[2] < array[1]) {
                std::swap(array[0], array[2]);
            } else {
                int tmp = array[0];
                array[0] = array[1];
                array[1] = array[2];
                array[2] = tmp;
            }
        } else {
            std::swap(array[0], array[1]);
        }
    } else {
        if (array[2] < array[1]) {
            if (array[2] < array[0]) {
                int tmp = array[2];
                array[2] = array[1];
                array[1] = array[0];
                array[0] = tmp;
            } else {
                std::swap(array[1], array[2]);
            }
        }
    }
}

看起来有点复杂，因为排序对于数组的每一种可能的排列或多或少都有一个分支，使用 2~3 次比较和最多 4 次赋值对三个值进行排序。

对 4 个值进行排序

此函数调用 sort3，然后对数组的最后一个元素执行展开插入排序：

void sort4(int* array)
{
    // Sort the first 3 elements
    sort3(array);

    // Insert the 4th element with insertion sort 
    if (array[3] < array[2]) {
        std::swap(array[2], array[3]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[1] < array[0]) {
                std::swap(array[0], array[1]);
            }
        }
    }
}

该算法执行 3 到 6 次比较，最多 5 次交换。展开插入排序很容易，但我们将使用另一种算法进行最后一个排序...

对 6 个值进行排序

此算法使用我所谓的双插入排序的展开版本。这个名字不太好，但它具有很强的描述性，它的工作原理如下：对

数组中除第一个和最后一个元素之外的所有元素进行排序。
如果第一个元素大于最后一个元素，则交换数组的第一个元素和元素。
从前面将第一个元素插入到排序序列中，然后从后面将最后一个元素插入到排序序列中。

交换之后，第一个元素总是小于最后一个元素，这意味着，当将它们插入到排序序列中时，在最坏的情况下不会有超过 N 次的比较来插入两个元素：例如，如果第一个元素已插入到第 3 个位置，那么最后一个元素不能插入到低于第 4 个位置。

void sort6(int* array)
{
    // Sort everything but first and last elements
    sort4(array+1);

    // Switch first and last elements if needed
    if (array[5] < array[0]) {
        std::swap(array[0], array[5]);
    }

    // Insert first element from the front
    if (array[1] < array[0]) {
        std::swap(array[0], array[1]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[4] < array[3]) {
                    std::swap(array[3], array[4]);
                }
            }
        }
    }

    // Insert last element from the back
    if (array[5] < array[4]) {
        std::swap(array[4], array[5]);
        if (array[4] < array[3]) {
            std::swap(array[3], array[4]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[2] < array[1]) {
                    std::swap(array[1], array[2]);
                }
            }
        }
    }
}

我对 6 个值的每个排列的测试表明，该算法始终执行 6 到 13 次比较。我没有计算执行的交换次数，但我预计在最坏的情况下它不会高于 11 次。

我希望这会有所帮助，即使这个问题可能不再代表实际问题:)

编辑：将其放入提供的基准测试中后，它显然比大多数有趣的替代方案慢。它的性能往往比展开的插入排序好一点，但仅此而已。基本上，它不是整数的最佳排序，但对于具有昂贵比较操作的类型可能很有趣。

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).

Sorting 3 values

The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:

void sort3(int* array)
{
    if (array[1] < array[0]) {
        if (array[2] < array[0]) {
            if (array[2] < array[1]) {
                std::swap(array[0], array[2]);
            } else {
                int tmp = array[0];
                array[0] = array[1];
                array[1] = array[2];
                array[2] = tmp;
            }
        } else {
            std::swap(array[0], array[1]);
        }
    } else {
        if (array[2] < array[1]) {
            if (array[2] < array[0]) {
                int tmp = array[2];
                array[2] = array[1];
                array[1] = array[0];
                array[0] = tmp;
            } else {
                std::swap(array[1], array[2]);
            }
        }
    }
}

It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.

Sorting 4 values

This one calls sort3 then performs an unrolled insertion sort with the last element of the array:

void sort4(int* array)
{
    // Sort the first 3 elements
    sort3(array);

    // Insert the 4th element with insertion sort 
    if (array[3] < array[2]) {
        std::swap(array[2], array[3]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[1] < array[0]) {
                std::swap(array[0], array[1]);
            }
        }
    }
}

This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...

Sorting 6 values

This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:

Sort everything but the first and the last elements of the array.
Swap the first and the elements of the array if the first is greater than the last.
Insert the first element into the sorted sequence from the front then the last element from the back.

After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.

void sort6(int* array)
{
    // Sort everything but first and last elements
    sort4(array+1);

    // Switch first and last elements if needed
    if (array[5] < array[0]) {
        std::swap(array[0], array[5]);
    }

    // Insert first element from the front
    if (array[1] < array[0]) {
        std::swap(array[0], array[1]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[4] < array[3]) {
                    std::swap(array[3], array[4]);
                }
            }
        }
    }

    // Insert last element from the back
    if (array[5] < array[4]) {
        std::swap(array[4], array[5]);
        if (array[4] < array[3]) {
            std::swap(array[3], array[4]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[2] < array[1]) {
                    std::swap(array[1], array[2]);
                }
            }
        }
    }
}

My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.

I hope that this helps, even if this question may not represent an actual problem anymore :)

EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

回复收藏 0 原文

柒七 2024-09-08 09:03:45

我发现至少在我的系统上，下面定义的函数 sort6_iterator() 和 sort6_iterator_local() 的运行速度至少与上面当前的运行速度一样快，而且经常明显更快记录保持者：

#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)

template<class IterType> 
inline void sort6_iterator(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
  const auto b = MAX(*(it + x), *(it + y)); \
  *(it + x) = a; *(it + y) = b; }
  
  SWAP(1, 2) SWAP(4, 5)
  SWAP(0, 2) SWAP(3, 5)
  SWAP(0, 1) SWAP(3, 4)
  SWAP(1, 4) SWAP(0, 3)
  SWAP(2, 5) SWAP(1, 3)
  SWAP(2, 4)
  SWAP(2, 3)
#undef SWAP
}

我在计时代码中将此函数传递给了 std::vector 的迭代器。

编译器可以更积极地优化模板函数可能是速度的原因之一。（我还怀疑使用迭代器可以为 g++ 提供关于迭代器引用的内存的某些保证（否则它不会有），这有助于优化。）IIRC 这也是原因的一部分为什么这么多 STL 算法（例如 std::sort()）通常具有如此出色的性能。

编译器可以做到这一点，因为模板函数不是函数。 C++ 继承了 C 的许多规则，这些规则限制编译器如何处理代码，例如：

重新排列/更改/删除代码的限制。
编译器可以做出的假设的限制。

C++ 委员会决定模板应遵循更适合优化的不同规则。

例如：在 void sort2(int* p, int* q){/*...*/} 中，编译器不能假设 p 和 q 指向非重叠内存，即 p != q （因为 swap2(d, d) 可能会在某处调用），甚至 < code>p 和 q 不为空。如果它是内联的，那么它可能会在像 swap2(d,d+1) 这样的调用中发现 p != q。编译器还对引用有一定的保证（例如：与指针不同，它们永远不会为空），因此它们比指针更适合优化。

此外，IIRC，编译器必须将像 sort2 这样的函数存储在内存中，以防另一个源文件尝试通过函数指针调用 swapMem。这对于像 template这样的模板函数来说不是必需的。 void sort2(T* p, T* q){/*...*/}，不需要通过函数指针访问。哪些函数可以内联，但不适用于函数模板，也存在一些限制（我见过的 STL 算法的所有实现都有大量的方法/函数（模板）调用，但是它们仍然运行得非常快，因为它们中的大多数实际上是被编译器内联掉的）。

值得一提的是，哪种代码最快可能在很大程度上取决于优化器。由于不同的代码可能会进行不同的优化，这可以解释为什么 sort6 的变化（例如：使用不同的排序网络，或定义 MAX/MIN/ SWAP 不同）可能有非常不同的运行时间。

例如，sort6_iterator()有时（同样，取决于调用该函数的上下文）始终优于以下排序函数，该函数将数据复制到¹ 由于只定义了 6 个局部变量，如果这些局部变量是原语，那么它们可能从未实际存储在 RAM 中，而只存储在 CPU 的寄存器中，直到函数调用结束，这有助于加快排序功能。（编译器知道不同的局部变量在内存中具有不同的位置也有帮助）。

template<class IterType> 
inline void sort6_iterator_local(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  const auto b = MAX(data##x, data##y); \
  data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a)   auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
  
  DD2(1,2)    SWAP(1, 2)
  DD2(4,5)    SWAP(4, 5)
  DD1(0)      SWAP(0, 2)
  DD1(3)      SWAP(3, 5)
  SWAP(0, 1)  SWAP(3, 4)
  SWAP(1, 4)  SWAP(0, 3)   CB(0)
  SWAP(2, 5)  CB(5)
  SWAP(1, 3)  CB(1)
  SWAP(2, 4)  CB(4)
  SWAP(2, 3)  CB(2)        CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}

请注意，按如下方式定义 SWAP() 有时会导致性能稍好一些，但大多数时候它会导致性能稍差或性能差异可以忽略不计。

#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  data##y = MAX(data##x, data##y); \
  data##x = a; }

如果您只想要一个针对原始数据类型的排序算法，那么无论对排序函数的调用出现在¹中的什么上下文中，gcc -O3 始终擅长优化，具体取决于您传递输入的方式，尝试以下两种算法之一：

template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a)   register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a)   *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

或者，如果您想通过引用传递变量，则使用此算法（下面的函数与上面的前 5 行不同）：

template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a)   register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a)   e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

使用 register 的原因关键字是因为这是您知道需要在寄存器中使用这些 int 值的少数情况之一（如果可能的话）。如果没有register，编译器在大多数情况下会计算出这一点，但有时却不会。使用register关键字有助于解决这个问题。但是，通常情况下，不要使用 register 关键字，因为它更有可能减慢代码而不是加速代码（例如：对 sort6 使用所有 6 个寄存器意味着它们可以'不能用于其他用途，这可能会导致代码整体变慢）。

在对各种排序函数进行计时时，我注意到调用排序函数的上下文（即周围的代码）对性能有重大影响，这可能是由于函数被内联然后优化所致。例如，如果程序足够简单，那么向排序函数传递一个指针与向其传递一个迭代器之间通常在性能上没有太大差异；否则，使用迭代器通常会带来明显更好的性能，并且永远不会（至少根据我到目前为止的经验）出现任何明显更差的性能。我怀疑这可能是因为g++可以全局优化足够简单的代码。

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:

#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)

template<class IterType> 
inline void sort6_iterator(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
  const auto b = MAX(*(it + x), *(it + y)); \
  *(it + x) = a; *(it + y) = b; }
  
  SWAP(1, 2) SWAP(4, 5)
  SWAP(0, 2) SWAP(3, 5)
  SWAP(0, 1) SWAP(3, 4)
  SWAP(1, 4) SWAP(0, 3)
  SWAP(2, 5) SWAP(1, 3)
  SWAP(2, 4)
  SWAP(2, 3)
#undef SWAP
}

I passed this function a std::vector's iterator in my timing code.

That compilers can more aggressively optimize template functions might be one reason for the speed. (I also suspect that using iterators gives g++ certain assurances (that it otherwise wouldn't have) about what the memory that the iterator refers to, which facilitates optimization.) IIRC this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.

Compilers can do this because template functions are NOT functions. C++ inherited many rules from C that limit what how compilers can do with code, like:

limits on rearranging/changing/removing code.
limits on what assumptions the compiler can make.

The C++ committee decided that templates should follow different rules that are more amenable to optimization.

For example: in void sort2(int* p, int* q){/*...*/}, the compiler can't assume that p and q point to non-overlapping memory, that p != q (since swap2(d, d) might be called somewhere), or even that p and q are not null. If it's inlined then it might figure out that p != q in a call like swap2(d,d+1). Compilers also have certain guarantees about references (ex: they're never null, unlike pointers), so they're more amenable to optimization than pointers.

Also, IIRC, compilers MUST store functions like sort2 in memory in case another source file tries calling swapMem via a function pointer. This isn't required of template functions like template<class T> void sort2(T* p, T* q){/*...*/}, which are not required to be accessible via function pointers. There are also limits to what functions can be inlined that don't apply to function templates (all implementations of the STL algorithms that I've seen have a plethora of method/function (template) calls but they still run super fast since most of them really are inlined away by the compiler).

It's important to mention that what code is fastest probably depends a lot on the optimizer. Since different code might be optimized differently, this may explain why variations of sort6 (ex: using different sorting networks, or defining MAX/MIN/SWAP differently) may have very different run-times.

For example, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.¹ Since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).

template<class IterType> 
inline void sort6_iterator_local(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  const auto b = MAX(data##x, data##y); \
  data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a)   auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
  
  DD2(1,2)    SWAP(1, 2)
  DD2(4,5)    SWAP(4, 5)
  DD1(0)      SWAP(0, 2)
  DD1(3)      SWAP(3, 5)
  SWAP(0, 1)  SWAP(3, 4)
  SWAP(1, 4)  SWAP(0, 3)   CB(0)
  SWAP(2, 5)  CB(5)
  SWAP(1, 3)  CB(1)
  SWAP(2, 4)  CB(4)
  SWAP(2, 3)  CB(2)        CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}

Note that defining SWAP() as follows sometimes results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.

#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  data##y = MAX(data##x, data##y); \
  data##x = a; }

If you just want a sorting algorithm that on primitive data types, gcc -O3 is consistently good at optimizing no matter what context the call to the sorting function appears in¹ then, depending on how you pass the input, try one of the following two algorithms:

template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a)   register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a)   *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

Or if you want to pass the variables by reference then use this (the below function differs from the above in its first 5 lines):

template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a)   register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a)   e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

The reason for using the register keyword is because this is one of the few times that you know that you want these int values in registers (if that's possible). Without register, the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up (ex: using all 6 registers for sort6 means they can't be used for something else, which might result in overall slower code).

While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

回复收藏 0 原文

帝王念 2024-09-08 09:03:45

我相信你的问题有两个部分。

首先是确定最优算法。这是通过循环遍历每个可能的排序（没有那么多）来完成的（至少在这种情况下），这允许您计算比较和交换的精确最小值、最大值、平均值和标准偏差。还有一两个亚军。
二是优化算法。将教科书代码示例转换为平均且精益的现实算法可以做很多事情。如果您意识到算法无法优化到所需的程度，请尝试亚军。

我不会太担心清空管道（假设当前是 x86）：分支预测已经取得了长足的进步。我担心的是确保代码和数据各自适合一个缓存行（代码可能是两个）。一旦出现，获取延迟就会非常低，这将弥补任何停顿。这也意味着你的内部循环可能有 10 条指令左右，这正是它应该在的位置（我的排序算法中有两个不同的内部循环，它们分别是 10 条指令/22 字节和 9/22 长）。假设代码不包含任何 div，您可以确定它会快得令人眼花缭乱。

回复收藏 0 原文

违心° 2024-09-08 09:03:45

我知道这是一个老问题。

但我只是写了一个不同类型的解决方案，我想分享。
仅使用嵌套的 MIN MAX，

速度并不快，因为每个使用 114 个，
可以像这样简单地将其减少到 75 -> pastebin

但它不再纯粹是最小最大了。

可能有效的方法是使用 AVX 一次对多个整数执行最小/最大

PMINSW 参考

#include <stdio.h>

static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
  unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
  "=A" (x));
  return x;
}

#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))

static __inline__ void sort6(int * in) {
  const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];

  in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );

  const int
  AB = MAX(A, B),
  AC = MAX(A, C),
  AD = MAX(A, D),
  AE = MAX(A, E),
  AF = MAX(A, F),
  BC = MAX(B, C),
  BD = MAX(B, D),
  BE = MAX(B, E),
  BF = MAX(B, F),
  CD = MAX(C, D),
  CE = MAX(C, E),
  CF = MAX(C, F),
  DE = MAX(D, E),
  DF = MAX(D, F),
  EF = MAX(E, F);

  in[1] = MIN4 (
  MIN4( AB, AC, AD, AE ),
  MIN4( AF, BC, BD, BE ),
  MIN4( BF, CD, CE, CF ),
  MIN3( DE, DF, EF)
  );

  const int
  ABC = MAX(AB,C),
  ABD = MAX(AB,D),
  ABE = MAX(AB,E),
  ABF = MAX(AB,F),
  ACD = MAX(AC,D),
  ACE = MAX(AC,E),
  ACF = MAX(AC,F),
  ADE = MAX(AD,E),
  ADF = MAX(AD,F),
  AEF = MAX(AE,F),
  BCD = MAX(BC,D),
  BCE = MAX(BC,E),
  BCF = MAX(BC,F),
  BDE = MAX(BD,E),
  BDF = MAX(BD,F),
  BEF = MAX(BE,F),
  CDE = MAX(CD,E),
  CDF = MAX(CD,F),
  CEF = MAX(CE,F),
  DEF = MAX(DE,F);

  in[2] = MIN( MIN4 (
  MIN4( ABC, ABD, ABE, ABF ),
  MIN4( ACD, ACE, ACF, ADE ),
  MIN4( ADF, AEF, BCD, BCE ),
  MIN4( BCF, BDE, BDF, BEF )),
  MIN4( CDE, CDF, CEF, DEF )
  );


  const int
  ABCD = MAX(ABC,D),
  ABCE = MAX(ABC,E),
  ABCF = MAX(ABC,F),
  ABDE = MAX(ABD,E),
  ABDF = MAX(ABD,F),
  ABEF = MAX(ABE,F),
  ACDE = MAX(ACD,E),
  ACDF = MAX(ACD,F),
  ACEF = MAX(ACE,F),
  ADEF = MAX(ADE,F),
  BCDE = MAX(BCD,E),
  BCDF = MAX(BCD,F),
  BCEF = MAX(BCE,F),
  BDEF = MAX(BDE,F),
  CDEF = MAX(CDE,F);

  in[3] = MIN4 (
  MIN4( ABCD, ABCE, ABCF, ABDE ),
  MIN4( ABDF, ABEF, ACDE, ACDF ),
  MIN4( ACEF, ADEF, BCDE, BCDF ),
  MIN3( BCEF, BDEF, CDEF )
  );

  const int
  ABCDE= MAX(ABCD,E),
  ABCDF= MAX(ABCD,F),
  ABCEF= MAX(ABCE,F),
  ABDEF= MAX(ABDE,F),
  ACDEF= MAX(ACDE,F),
  BCDEF= MAX(BCDE,F);

  in[4]= MIN (
  MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
  MIN ( ACDEF, BCDEF )
  );

  in[5] = MAX(ABCDE,F);
}

int main(int argc, char ** argv) {
  int d[6][6] = {
    {1, 2, 3, 4, 5, 6},
    {6, 5, 4, 3, 2, 1},
    {100, 2, 300, 4, 500, 6},
    {100, 2, 3, 4, 500, 6},
    {1, 200, 3, 4, 5, 600},
    {1, 1, 2, 1, 2, 1}
  };

  unsigned long long cycles = rdtsc();
  for (int i = 0; i < 6; i++) {
    sort6(d[i]);
  }
  cycles = rdtsc() - cycles;
  printf("Time is %d\n", (unsigned)cycles);

  for (int i = 0; i < 6; i++) {
    printf("d%d : %d %d %d %d %d %d\n", i,
     d[i][0], d[i][1], d[i][2],
     d[i][3], d[i][4], d[i][5]);
  }
}

编辑:
排名顺序解决方案的灵感来自 Rex Kerr，
比上面的混乱快得多

static void sort6(int *o) {
const int 
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
          BC = B>C, BD = B>D, BE = B>E,
                    CD = C>D, CE = C>E,
                              DE = D>E,
a =          AB + AC + AD + AE + (A>F),
b = 1 - AB      + BC + BD + BE + (B>F),
c = 2 - AC - BC      + CD + CE + (C>F),
d = 3 - AD - BD - CD      + DE + (D>F),
e = 4 - AE - BE - CE - DE      + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}

I know this is an old question.

But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,

It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin

But then it's not purely min max anymore.

What might work is doing min/max on multiple integers at once with AVX

PMINSW reference

#include <stdio.h>

static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
  unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
  "=A" (x));
  return x;
}

#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))

static __inline__ void sort6(int * in) {
  const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];

  in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );

  const int
  AB = MAX(A, B),
  AC = MAX(A, C),
  AD = MAX(A, D),
  AE = MAX(A, E),
  AF = MAX(A, F),
  BC = MAX(B, C),
  BD = MAX(B, D),
  BE = MAX(B, E),
  BF = MAX(B, F),
  CD = MAX(C, D),
  CE = MAX(C, E),
  CF = MAX(C, F),
  DE = MAX(D, E),
  DF = MAX(D, F),
  EF = MAX(E, F);

  in[1] = MIN4 (
  MIN4( AB, AC, AD, AE ),
  MIN4( AF, BC, BD, BE ),
  MIN4( BF, CD, CE, CF ),
  MIN3( DE, DF, EF)
  );

  const int
  ABC = MAX(AB,C),
  ABD = MAX(AB,D),
  ABE = MAX(AB,E),
  ABF = MAX(AB,F),
  ACD = MAX(AC,D),
  ACE = MAX(AC,E),
  ACF = MAX(AC,F),
  ADE = MAX(AD,E),
  ADF = MAX(AD,F),
  AEF = MAX(AE,F),
  BCD = MAX(BC,D),
  BCE = MAX(BC,E),
  BCF = MAX(BC,F),
  BDE = MAX(BD,E),
  BDF = MAX(BD,F),
  BEF = MAX(BE,F),
  CDE = MAX(CD,E),
  CDF = MAX(CD,F),
  CEF = MAX(CE,F),
  DEF = MAX(DE,F);

  in[2] = MIN( MIN4 (
  MIN4( ABC, ABD, ABE, ABF ),
  MIN4( ACD, ACE, ACF, ADE ),
  MIN4( ADF, AEF, BCD, BCE ),
  MIN4( BCF, BDE, BDF, BEF )),
  MIN4( CDE, CDF, CEF, DEF )
  );


  const int
  ABCD = MAX(ABC,D),
  ABCE = MAX(ABC,E),
  ABCF = MAX(ABC,F),
  ABDE = MAX(ABD,E),
  ABDF = MAX(ABD,F),
  ABEF = MAX(ABE,F),
  ACDE = MAX(ACD,E),
  ACDF = MAX(ACD,F),
  ACEF = MAX(ACE,F),
  ADEF = MAX(ADE,F),
  BCDE = MAX(BCD,E),
  BCDF = MAX(BCD,F),
  BCEF = MAX(BCE,F),
  BDEF = MAX(BDE,F),
  CDEF = MAX(CDE,F);

  in[3] = MIN4 (
  MIN4( ABCD, ABCE, ABCF, ABDE ),
  MIN4( ABDF, ABEF, ACDE, ACDF ),
  MIN4( ACEF, ADEF, BCDE, BCDF ),
  MIN3( BCEF, BDEF, CDEF )
  );

  const int
  ABCDE= MAX(ABCD,E),
  ABCDF= MAX(ABCD,F),
  ABCEF= MAX(ABCE,F),
  ABDEF= MAX(ABDE,F),
  ACDEF= MAX(ACDE,F),
  BCDEF= MAX(BCDE,F);

  in[4]= MIN (
  MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
  MIN ( ACDEF, BCDEF )
  );

  in[5] = MAX(ABCDE,F);
}

int main(int argc, char ** argv) {
  int d[6][6] = {
    {1, 2, 3, 4, 5, 6},
    {6, 5, 4, 3, 2, 1},
    {100, 2, 300, 4, 500, 6},
    {100, 2, 3, 4, 500, 6},
    {1, 200, 3, 4, 5, 600},
    {1, 1, 2, 1, 2, 1}
  };

  unsigned long long cycles = rdtsc();
  for (int i = 0; i < 6; i++) {
    sort6(d[i]);
  }
  cycles = rdtsc() - cycles;
  printf("Time is %d\n", (unsigned)cycles);

  for (int i = 0; i < 6; i++) {
    printf("d%d : %d %d %d %d %d %d\n", i,
     d[i][0], d[i][1], d[i][2],
     d[i][3], d[i][4], d[i][5]);
  }
}

EDIT:
Rank order solution inspired by Rex Kerr's,
Much faster than the mess above

static void sort6(int *o) {
const int 
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
          BC = B>C, BD = B>D, BE = B>E,
                    CD = C>D, CE = C>E,
                              DE = D>E,
a =          AB + AC + AD + AE + (A>F),
b = 1 - AB      + BC + BD + BE + (B>F),
c = 2 - AC - BC      + CD + CE + (C>F),
d = 3 - AD - BD - CD      + DE + (D>F),
e = 4 - AE - BE - CE - DE      + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}

回复收藏 0 原文

柳絮泡泡 2024-09-08 09:03:45

我想我应该尝试展开 Ford-Johnson 合并插入排序，它可以实现最小可能的比较次数 (ceil(log2(6!)) = 10) 并且没有交换。
不过，它并不竞争（我的时机比最差的排序网络解决方案 sort6_sorting_network_v1 稍好）。

它将值加载到 6 个寄存器中，然后执行 8 到 10 次比较
来决定720=6中的哪一个！
情况，然后将寄存器写回到适当的寄存器中
这 720 个订单中（每个案例都有单独的代码）。
在最终回写之前，不会进行任何交换或重新排序。我还没有查看生成的汇编代码。

static inline void sort6_ford_johnson_unrolled(int *D) {
  register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
  #define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
  #define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
                                     : c<b ? abcdef(a,c,b,d,e,f) \
                                           : abcdef(a,b,c,d,e,f))
  #define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
                                           : abcdef(a,c,b,e,d,f) \
                                     : c<e ? abcdef(a,b,c,e,d,f) \
                                           : abcdef(a,b,e,c,d,f))
  #define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
                                             : abedf_cd(a,e,c,d,b,f) \
                                       : e<d ? abedf_cd(a,b,c,d,e,f) \
                                             : abdef_cd(a,b,c,d,e,f))
  #define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
                                      : b<f ? abdf_cd_ef(a,b,e,f,c,d) \
                                            : abdf_cd_ef(e,f,a,b,c,d))
  #define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
                                     : abd_cd_ef(c,d,a,b,e,f))
  #define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
                                  : ab_cd_ef(a,b,c,d,f,e))
  #define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
                               : ab_cd(a,b,d,c,e,f))
  a<b ? ab(a,b,c,d,e,f)
      : ab(b,a,c,d,e,f);
  #undef ab
  #undef ab_cd
  #undef ab_cd_ef
  #undef abd_cd_ef
  #undef abdf_cd_ef
  #undef abedf_cd
  #undef abdef_cd
  #undef abcdef
}

TEST(ford_johnson_unrolled,   "Unrolled Ford-Johnson Merge-Insertion sort");

I thought I'd try an unrolled Ford-Johnson merge-insertion sort, which achieves the minimum possible number of comparisons (ceil(log2(6!)) = 10) and no swaps.
It doesn't compete, though (I got a slightly better timing than the worst sorting networks solution sort6_sorting_network_v1).

It loads the values into six registers, then performs 8 to 10 comparisons
to decide which of the 720=6!
cases it's in, then writes the registers back in the appropriate one
of those 720 orders (separate code for each case).
There are no swaps or reordering of anything until the final write-back. I haven't looked at the generated assembly code.

static inline void sort6_ford_johnson_unrolled(int *D) {
  register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
  #define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
  #define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
                                     : c<b ? abcdef(a,c,b,d,e,f) \
                                           : abcdef(a,b,c,d,e,f))
  #define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
                                           : abcdef(a,c,b,e,d,f) \
                                     : c<e ? abcdef(a,b,c,e,d,f) \
                                           : abcdef(a,b,e,c,d,f))
  #define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
                                             : abedf_cd(a,e,c,d,b,f) \
                                       : e<d ? abedf_cd(a,b,c,d,e,f) \
                                             : abdef_cd(a,b,c,d,e,f))
  #define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
                                      : b<f ? abdf_cd_ef(a,b,e,f,c,d) \
                                            : abdf_cd_ef(e,f,a,b,c,d))
  #define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
                                     : abd_cd_ef(c,d,a,b,e,f))
  #define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
                                  : ab_cd_ef(a,b,c,d,f,e))
  #define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
                               : ab_cd(a,b,d,c,e,f))
  a<b ? ab(a,b,c,d,e,f)
      : ab(b,a,c,d,e,f);
  #undef ab
  #undef ab_cd
  #undef ab_cd_ef
  #undef abd_cd_ef
  #undef abdf_cd_ef
  #undef abedf_cd
  #undef abdef_cd
  #undef abcdef
}

TEST(ford_johnson_unrolled,   "Unrolled Ford-Johnson Merge-Insertion sort");

回复收藏 0 原文

咿呀咿呀哟 2024-09-08 09:03:45

我知道聚会已经过去 12 年了，但仍然如此。

的并行迭代对少数项目进行排序的最佳方法，

shifted = [0 sorted](1 : length(sorted));
shifted = max(shifted, new_value);
sorted = min(sorted, shifted);

我找到了使用从 sorted = Ones(1, N) * max_value; 开始

，这也能很好地实现 SIMDify

using vec4 = int32_t __attribute__((vector_size(16));

int32_t max = 0x7fffffff;

vec4 sorted_lo = { max, max, max, max }, sorted_hi = sorted_lo;
sorted_lo[0] = data[0];  // "pre-sort" the first element directly

for (int i = 1 ; i < 6; i++) {
 vec4 new_value{ data[i], data[i], data[i], data[i] };
 vec4 shifted_lo = { 0, sorted_lo[0], sorted_lo[1], sorted_lo[2]};
 vec4 shifted_hi = { sorted_lo[3],sorted_hi[0],sorted_hi[1],sorted_hi[2]};

 shifted_lo = max(shifted_lo, new_value0);
 shifted_hi = max(shifted_hi, new_value0);

 sorted_lo = max(sorted_lo, shifted_lo);
 sorted_hi = max(sorted_hi, shifted_hi);
}

即使对于许多 SIMD 架构（SSE4、ARM64）包含最小/最大操作，以及两个向量之间的移位以及单个通道的复制。在 AVX2 中，无法在通道之间有效地进行转换，但可以对两个包含 5-8 个元素的独立向量进行排序。

一个大大改进的版本将使用两个向量的双调合并排序。第一个向量对前 3 个值使用插入排序技巧，例如来自 1 8 4 3 5 4（如 1 4 8 inf），另一个向量针对 inf 进行排序5 4 3。

然后，双调合并阶段将排序 A = min(a,b); B = max(a,b)，然后对A[0]<->A[2], A[1]<->A[3], B[0 ]<->B[2], B[1]<->B[3] 并行，最后是最后阶段 A[0]<->A[1 ]，A[2]→A[3]，B[0]→B[1]，B[2]→B[3]。

对于示例值，这意味着

  A = 1 4 8 inf, B = inf 5 4 3
      <--------------->
        <---------------->
          <---------------->
             <--------------->
  A = 1 4 4 3  , B = inf 5 8 inf
      <--->          <----->
        <--->            <---->
  A = 1 3 4 4  , B = 8   5 inf inf
      <-> <->        <---> <---->
  A = 1 3 4 5  , B = 5   8 inf inf

可能最困难的步骤是 A[0,2]、A[1,3]、B[0,2]、B[1,3] 阶段，需要一些洗牌。相比之下，至少 ARM64 NEON 有一条指令来提取成对的最小值/最大值。

为了回答这个评论，将数据传输到向量寄存器通常不是问题。从 SIMD 寄存器或内存检索数据时可能会有较长的延迟。对于SSE2，我会尝试将6个元素放入一个向量中，

int32_t x[8] = {inf, 0,0,0,0,0,0, inf};

该向量将允许快速洗牌（pshufd / _mm_shuffle_epi32）以获得val[1], inf的第一个排序向量, inf, inf 与 inf, inf, inf, val[4] 以及通过 XXM 寄存器广播任何通道。

I know the party's over 12 years ago, but still.

I've found the best way to sort a handful of items using the parallel iteration of

shifted = [0 sorted](1 : length(sorted));
shifted = max(shifted, new_value);
sorted = min(sorted, shifted);

Starting with sorted = ones(1, N) * max_value;

This is going to SIMDify quite nicely even with

using vec4 = int32_t __attribute__((vector_size(16));

int32_t max = 0x7fffffff;

vec4 sorted_lo = { max, max, max, max }, sorted_hi = sorted_lo;
sorted_lo[0] = data[0];  // "pre-sort" the first element directly

for (int i = 1 ; i < 6; i++) {
 vec4 new_value{ data[i], data[i], data[i], data[i] };
 vec4 shifted_lo = { 0, sorted_lo[0], sorted_lo[1], sorted_lo[2]};
 vec4 shifted_hi = { sorted_lo[3],sorted_hi[0],sorted_hi[1],sorted_hi[2]};

 shifted_lo = max(shifted_lo, new_value0);
 shifted_hi = max(shifted_hi, new_value0);

 sorted_lo = max(sorted_lo, shifted_lo);
 sorted_hi = max(sorted_hi, shifted_hi);
}

Many SIMD architectures (SSE4, ARM64) contain the min/max operations, as well as the shifting between two vectors, and duplication of a single lane. In AVX2 one can not efficiently shift between the lanes, but one can sort two independent vectors of 5-8 elements.

A greatly improved version of this will use bitonic merge sort with two vectors. The first vector uses the insertion sort trick for 3 first values e.g. from 1 8 4 3 5 4 as in 1 4 8 inf, the other vector sorted for inf 5 4 3.

A bitonic merge stage will then sort A = min(a,b); B = max(a,b), then sort the A[0]<->A[2], A[1]<->A[3], B[0]<->B[2], B[1]<->B[3] in parallel and finally the last stage A[0]<->A[1], A[2]<->A[3],B[0]<->B[1], B[2]<->B[3].

With the example values, this would mean

  A = 1 4 8 inf, B = inf 5 4 3
      <--------------->
        <---------------->
          <---------------->
             <--------------->
  A = 1 4 4 3  , B = inf 5 8 inf
      <--->          <----->
        <--->            <---->
  A = 1 3 4 4  , B = 8   5 inf inf
      <-> <->        <---> <---->
  A = 1 3 4 5  , B = 5   8 inf inf

Probably the most difficult step there is A[0,2], A[1,3], B[0,2], B[1,3] stage, requiring some shuffling. In contrast at least ARM64 NEON has an instruction to extract the pairwise minimum/maximum.

To answer the comment, it's generally not an issue to transfer data to vector registers. There might be longer latency when retrieving the data either from SIMD registers or from memory. For SSE2, I would e.g. try to place the 6 elements in a vector of

int32_t x[8] = {inf, 0,0,0,0,0,0, inf};

which would allow fast shuffling to (pshufd / _mm_shuffle_epi32) to get the first sorted vectors of val[1], inf, inf, inf vs inf, inf, inf, val[4] and also broadcasting any lane over an XXM registers.

回复收藏 0 原文

瀟灑尐姊 2024-09-08 09:03:45

也许我来晚了，但至少我的贡献是一种新方法。

代码确实应该内联，
即使是内联，有太多分支，
分析部分基本上是 O(N(N-1))，对于 N=6 来说似乎没问题，
代码可能会更有效，如果交换的成本会更高（相对于比较的成本），
我相信内联的静态函数。
该方法与排序相关
- 使用相对排名（偏移量）来代替排名。
- 任何排列组中每个周期的排名总和为零。
- 不是 SWAP() 两个元素，而是追逐循环，只需要一个临时值和一个（寄存器 -> 寄存器）交换（新 <- 旧）。

更新：稍微改变了代码，有些人使用C++编译器来编译C代码...

#include <stdio.h>

#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif

static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);

int wsort6(int *arr);

void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", *arr);
        }
fprintf(stderr,"\n");
}

void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", (int) *arr);
        }
fprintf(stderr,"\n");
}

static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;

for (top = 0; top < cnt; top++ ) {
        for (bot = 0; bot < top; bot++ ) {
                if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
                }
        }
return ;
}
        /* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){

dif[0] =     (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}

static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;

Dif difs[cnt];

#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);

#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif

nswap = 0;
for (idx=0; idx < cnt; idx++) {
        int newval;
        int step,cyc;
        if ( !difs[idx] ) continue;
        newval = arr[idx];
        cyc = 0;
        src = idx;
        do      {
                int oldval;
                step = difs[src];
                difs[src] =0;
                dst = src + step;
                cyc += step ;
                if(dst == idx+1)idx=dst;
                oldval = arr[dst];
#if (DUMP_IT&1)
                fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d  Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
                        , nswap, cyc, step, idx, oldval, newval
                        , src, dst, difs[dst], arr[dst]
                        , newval  );
                do_print_a("Array ", arr, cnt);
                do_print_d("Steps ", difs, cnt);
#endif

                arr[dst] = newval;
                newval = oldval;
                nswap++;
                src = dst;
                } while( cyc);
        }

return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

Maybe I am late to the party, but at least my contribution is a new approach.

The code really should be inlined
even if inlined, there are too many branches
the analysing part is basically O(N(N-1)) which seems OK for N=6
the code could be more effective if the cost of swap would be higher (irt the cost of compare)
I trust on static functions being inlined.
The method is related to rank-sort
- instead of ranks, the relative ranks (offsets) are used.
- the sum of the ranks is zero for every cycle in any permutation group.
- instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).

Update: changed the code a bit, some people use C++ compilers to compile C code ...

#include <stdio.h>

#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif

static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);

int wsort6(int *arr);

void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", *arr);
        }
fprintf(stderr,"\n");
}

void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", (int) *arr);
        }
fprintf(stderr,"\n");
}

static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;

for (top = 0; top < cnt; top++ ) {
        for (bot = 0; bot < top; bot++ ) {
                if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
                }
        }
return ;
}
        /* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){

dif[0] =     (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}

static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;

Dif difs[cnt];

#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);

#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif

nswap = 0;
for (idx=0; idx < cnt; idx++) {
        int newval;
        int step,cyc;
        if ( !difs[idx] ) continue;
        newval = arr[idx];
        cyc = 0;
        src = idx;
        do      {
                int oldval;
                step = difs[src];
                difs[src] =0;
                dst = src + step;
                cyc += step ;
                if(dst == idx+1)idx=dst;
                oldval = arr[dst];
#if (DUMP_IT&1)
                fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d  Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
                        , nswap, cyc, step, idx, oldval, newval
                        , src, dst, difs[dst], arr[dst]
                        , newval  );
                do_print_a("Array ", arr, cnt);
                do_print_d("Steps ", difs, cnt);
#endif

                arr[dst] = newval;
                newval = oldval;
                nswap++;
                src = dst;
                } while( cyc);
        }

return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

回复收藏 0 原文

秋千易 2024-09-08 09:03:45

//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
    int t[6] = {0};
    int r1,r2;

    r1=0;
    r1 += (a[0] > a[1]);
    r1 += (a[0] > a[2]);
    r1 += (a[0] > a[3]);
    r1 += (a[0] > a[4]);
    r1 += (a[0] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[0];

    r2=0;
    r2 += (a[1] > a[0]);
    r2 += (a[1] > a[2]);
    r2 += (a[1] > a[3]);
    r2 += (a[1] > a[4]);
    r2 += (a[1] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[1];

    r1=0;
    r1 += (a[2] > a[0]);
    r1 += (a[2] > a[1]);
    r1 += (a[2] > a[3]);
    r1 += (a[2] > a[4]);
    r1 += (a[2] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[2];

    r2=0;
    r2 += (a[3] > a[0]);
    r2 += (a[3] > a[1]);
    r2 += (a[3] > a[2]);
    r2 += (a[3] > a[4]);
    r2 += (a[3] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[3];

    r1=0;
    r1 += (a[4] > a[0]);
    r1 += (a[4] > a[1]);
    r1 += (a[4] > a[2]);
    r1 += (a[4] > a[3]);
    r1 += (a[4] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[4];

    r2=0;
    r2 += (a[5] > a[0]);
    r2 += (a[5] > a[1]);
    r2 += (a[5] > a[2]);
    r2 += (a[5] > a[3]);
    r2 += (a[5] > a[4]);
    while(t[r2]){r2++;} 
    t[r2] = a[5];

    a[0]=t[0];
    a[1]=t[1];
    a[2]=t[2];
    a[3]=t[3];
    a[4]=t[4];
    a[5]=t[5];
}

static __inline__ void sort6(int* a)
{
    #define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
    register int t;

    wire( 0, 1); wire( 2, 3); wire( 4, 5);
    wire( 3, 5); wire( 0, 2); wire( 1, 4);
    wire( 4, 5); wire( 2, 3); wire( 0, 1); 
    wire( 3, 4); wire( 1, 2); 
    wire( 2, 3);

    #undef wire
}

//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
    int t[6] = {0};
    int r1,r2;

    r1=0;
    r1 += (a[0] > a[1]);
    r1 += (a[0] > a[2]);
    r1 += (a[0] > a[3]);
    r1 += (a[0] > a[4]);
    r1 += (a[0] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[0];

    r2=0;
    r2 += (a[1] > a[0]);
    r2 += (a[1] > a[2]);
    r2 += (a[1] > a[3]);
    r2 += (a[1] > a[4]);
    r2 += (a[1] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[1];

    r1=0;
    r1 += (a[2] > a[0]);
    r1 += (a[2] > a[1]);
    r1 += (a[2] > a[3]);
    r1 += (a[2] > a[4]);
    r1 += (a[2] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[2];

    r2=0;
    r2 += (a[3] > a[0]);
    r2 += (a[3] > a[1]);
    r2 += (a[3] > a[2]);
    r2 += (a[3] > a[4]);
    r2 += (a[3] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[3];

    r1=0;
    r1 += (a[4] > a[0]);
    r1 += (a[4] > a[1]);
    r1 += (a[4] > a[2]);
    r1 += (a[4] > a[3]);
    r1 += (a[4] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[4];

    r2=0;
    r2 += (a[5] > a[0]);
    r2 += (a[5] > a[1]);
    r2 += (a[5] > a[2]);
    r2 += (a[5] > a[3]);
    r2 += (a[5] > a[4]);
    while(t[r2]){r2++;} 
    t[r2] = a[5];

    a[0]=t[0];
    a[1]=t[1];
    a[2]=t[2];
    a[3]=t[3];
    a[4]=t[4];
    a[5]=t[5];
}

static __inline__ void sort6(int* a)
{
    #define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
    register int t;

    wire( 0, 1); wire( 2, 3); wire( 4, 5);
    wire( 3, 5); wire( 0, 2); wire( 1, 4);
    wire( 4, 5); wire( 2, 3); wire( 0, 1); 
    wire( 3, 4); wire( 1, 2); 
    wire( 2, 3);

    #undef wire
}

回复收藏 0 原文

绝情姑娘 2024-09-08 09:03:45

尝试“合并排序列表”排序。 :) 使用两个数组。对于小型和大型阵列来说速度最快。
如果你连接，你只检查插入位置。其他更大的值不需要比较（cmp = ab>0）。
对于 4 个号码，您可以使用系统 4-5 cmp (~4.6) 或 3-6 cmp (~4.9)。冒泡排序使用 6 cmp (6)。对于大数字较慢的代码有很多 cmp。
此代码使用 5 cmp（不是 MSL 排序）：

if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}`

主要 MSL

9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8

js 代码

function sortListMerge_2a(cmp)  
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end   = arr_count;
//var str = '';
cycles = 0;
if (end>3)
    {
    stepmax = ((end - start + 1) >> 1) << 1;
    m = 1;
    n = 2;
    for (step=1;step<stepmax;step<<=1)  //bounds 1-1, 2-2, 4-4, 8-8...
        {
        a = start;
        while (a<end)
            {
            b = a + step;
            c = a + step + step;
            b = b<end ? b : end;
            c = c<end ? c : end;
            i = a;
            j = b;
            k = i;
            while (i<b && j<c)
                {
                if (cmp(arr[m][i],arr[m][j])>0)
                    {arr[n][k] = arr[m][j]; j++; k++;}
                else    {arr[n][k] = arr[m][i]; i++; k++;}
                }
            while (i<b)
                {arr[n][k] = arr[m][i]; i++; k++;
}
            while (j<c)
                {arr[n][k] = arr[m][j]; j++; k++;
}
            a = c;
            }
        tmp = m; m = n; n = tmp;
        }
    return m;
    }
else
    {
    // sort 3 items
    sort10(cmp);
    return m;
    }
}

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):

if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}`

Principial MSL

9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8

js code

function sortListMerge_2a(cmp)  
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end   = arr_count;
//var str = '';
cycles = 0;
if (end>3)
    {
    stepmax = ((end - start + 1) >> 1) << 1;
    m = 1;
    n = 2;
    for (step=1;step<stepmax;step<<=1)  //bounds 1-1, 2-2, 4-4, 8-8...
        {
        a = start;
        while (a<end)
            {
            b = a + step;
            c = a + step + step;
            b = b<end ? b : end;
            c = c<end ? c : end;
            i = a;
            j = b;
            k = i;
            while (i<b && j<c)
                {
                if (cmp(arr[m][i],arr[m][j])>0)
                    {arr[n][k] = arr[m][j]; j++; k++;}
                else    {arr[n][k] = arr[m][i]; i++; k++;}
                }
            while (i<b)
                {arr[n][k] = arr[m][i]; i++; k++;
}
            while (j<c)
                {arr[n][k] = arr[m][j]; j++; k++;
}
            a = c;
            }
        tmp = m; m = n; n = tmp;
        }
    return m;
    }
else
    {
    // sort 3 items
    sort10(cmp);
    return m;
    }
}

回复收藏 0 原文

甜味拾荒者 2024-09-08 09:03:45

好吧，如果它只有 6 个元素，并且您可以利用并行性，想要最小化条件分支等。为什么您不生成所有组合并测试顺序呢？我敢说，在某些架构中，它可以非常快（只要你预先分配了内存）

回复收藏 0 原文

许久 2024-09-08 09:03:45

使用 cmp==0 对 4 个项目进行排序。
cmp 的数量约为 4.34（FF 原生约为 4.52），但比合并列表花费 3 倍的时间。但如果你有大数字或大文本，最好少做 cmp 操作。
编辑：修复bug

在线测试http://mlich.zam。 slu.cz/js-sort/x-sort-x2.htm

function sort4DG(cmp,start,end,n) // sort 4
{
var n     = typeof(n)    !=='undefined' ? n   : 1;
var cmp   = typeof(cmp)  !=='undefined' ? cmp   : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end   = typeof(end)  !=='undefined' ? end   : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
    {
    swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
    cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
    if (cc[13]>0)
        {
        swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
        return n;
        }
    else    {
    cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3]));  // new cc23 | c03 //repaired
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    }
else    {
    if (cc[12]>0)
        {
        swap(n,i+1,i+2);
        cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    else    {
        return n;
        }
    }
return n;
}

Sort 4 items with usage cmp==0.
Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text.
Edit: repaired bug

Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm

function sort4DG(cmp,start,end,n) // sort 4
{
var n     = typeof(n)    !=='undefined' ? n   : 1;
var cmp   = typeof(cmp)  !=='undefined' ? cmp   : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end   = typeof(end)  !=='undefined' ? end   : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
    {
    swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
    cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
    if (cc[13]>0)
        {
        swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
        return n;
        }
    else    {
    cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3]));  // new cc23 | c03 //repaired
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    }
else    {
    if (cc[12]>0)
        {
        swap(n,i+1,i+2);
        cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    else    {
        return n;
        }
    }
return n;
}

回复收藏 0 原文

携余温的黄昏 2024-09-08 09:03:44

对于任何优化，最好总是测试、测试、再测试。我至少会尝试排序网络和插入排序。如果我打赌，我会根据过去的经验把钱投入插入排序。

您知道有关输入数据的任何信息吗？某些算法对于某些类型的数据会表现得更好。例如，插入排序在排序或几乎排序的数据上表现更好，因此如果几乎排序数据的机会高于平均水平，那么插入排序将是更好的选择。

您发布的算法类似于插入排序，但看起来您已经以更多比较为代价最小化了交换次数。然而，比较比交换昂贵得多，因为分支可能导致指令管道停止。

这是插入排序的实现：

static __inline__ int sort6(int *d){
        int i, j;
        for (i = 1; i < 6; i++) {
                int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }
}

这是我构建排序网络的方式。首先，使用此站点为网络生成一组最小的 SWAP 宏适当的长度。将其包装在一个函数中可以得到：

static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

static __inline__ int sort6(int *d){
        int i, j;
        for (i = 1; i < 6; i++) {
                int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }
}

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}

回复收藏 0 原文

荒芜了季节 2024-09-08 09:03:44

这是使用排序网络的实现：

inline void Sort2(int *p0, int *p1)
{
    const int temp = min(*p0, *p1);
    *p1 = max(*p0, *p1);
    *p0 = temp;
}

inline void Sort3(int *p0, int *p1, int *p2)
{
    Sort2(p0, p1);
    Sort2(p1, p2);
    Sort2(p0, p1);
}

inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
    Sort2(p0, p1);
    Sort2(p2, p3);
    Sort2(p0, p2);  
    Sort2(p1, p3);  
    Sort2(p1, p2);  
}

inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
    Sort3(p0, p1, p2);
    Sort3(p3, p4, p5);
    Sort2(p0, p3);  
    Sort2(p2, p5);  
    Sort4(p1, p2, p3, p4);  
}

您确实需要非常高效的无分支min和 max 实现，因为这实际上就是这段代码归结为的 - 一系列 min 和 max 操作（每个 13 个，总计）。我将其作为练习留给读者。

请注意，此实现很容易实现矢量化（例如 SIMD - 大多数 SIMD ISA 具有矢量最小/最大指令），也适合 GPU 实现（例如 CUDA - 无分支，不存在扭曲发散等问题）。

另请参阅：快速算法实现对非常小的列表进行排序

Here's an implementation using sorting networks:

inline void Sort2(int *p0, int *p1)
{
    const int temp = min(*p0, *p1);
    *p1 = max(*p0, *p1);
    *p0 = temp;
}

inline void Sort3(int *p0, int *p1, int *p2)
{
    Sort2(p0, p1);
    Sort2(p1, p2);
    Sort2(p0, p1);
}

inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
    Sort2(p0, p1);
    Sort2(p2, p3);
    Sort2(p0, p2);  
    Sort2(p1, p3);  
    Sort2(p1, p2);  
}

inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
    Sort3(p0, p1, p2);
    Sort3(p3, p4, p5);
    Sort2(p0, p3);  
    Sort2(p2, p5);  
    Sort4(p1, p2, p3, p4);  
}

You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a sequence of min and max operations (13 of each, in total). I leave this as an exercise for the reader.

Note that this implementation lends itself easily to vectorization (e.g. SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).

回复收藏 0 原文

乄_柒ぐ汐 2024-09-08 09:03:44

由于这些是整数并且比较速度很快，为什么不直接计算每个的排名顺序：

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

Since these are integers and compares are fast, why not compute the rank order of each directly:

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

回复收藏 0 原文

信愁 2024-09-08 09:03:44

看起来我迟到了一年，但我们开始了...

查看 gcc 4.5.2 生成的程序集，我发现每次交换都会完成加载和存储，这实际上是不需要的。最好将 6 个值加载到寄存器中，对它们进行排序，然后将它们存储回内存中。我命令商店的负载尽可能靠近首先需要和最后使用寄存器的地方。我还使用了 Steinar H. Gunderson 的 SWAP 宏。更新：我切换到 Paolo Bonzini 的 SWAP 宏，gcc 将其转换为类似于 Gunderson 的宏，但 gcc 能够更好地对指令进行排序，因为它们没有作为显式汇编给出。

我使用与重新排序的交换网络相同的交换顺序作为最佳性能，尽管可能有更好的排序。如果我有更多时间，我将生成并测试一堆排列。

我更改了测试代码以考虑超过 4000 个数组，并显示对每个数组进行排序所需的平均周期数。在 i5-650 上，我得到约 34.1 周期/排序（使用 -O3），而原始重新排序的排序网络得到约 65.3 周期/排序（使用 -O1，击败 -O2 和 -O3）。

#include <stdio.h>

static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
    register int x0,x1,x2,x3,x4,x5;
    x1 = d[1];
    x2 = d[2];
    SWAP(x1, x2);
    x4 = d[4];
    x5 = d[5];
    SWAP(x4, x5);
    x0 = d[0];
    SWAP(x0, x2);
    x3 = d[3];
    SWAP(x3, x5);
    SWAP(x0, x1);
    SWAP(x3, x4);
    SWAP(x1, x4);
    SWAP(x0, x3);
    d[0] = x0;
    SWAP(x2, x5);
    d[5] = x5;
    SWAP(x1, x3);
    d[1] = x1;
    SWAP(x2, x4);
    d[4] = x4;
    SWAP(x2, x3);
    d[2] = x2;
    d[3] = x3;

#undef SWAP
#undef min
#undef max
}

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
    return x;
}

void ran_fill(int n, int *a) {
    static int seed = 76521;
    while (n--) *a++ = (seed = seed *1812433253 + 12345);
}

#define NTESTS 4096
int main() {
    int i;
    int d[6*NTESTS];
    ran_fill(6*NTESTS, d);

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6*NTESTS ; i+=6) {
        sort6_fast(d+i);
    }
    cycles = rdtsc() - cycles;
    printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);

    for (i = 0; i < 6*NTESTS ; i+=6) {
        if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
            printf("d%d : %d %d %d %d %d %d\n", i,
                    d[i+0], d[i+1], d[i+2],
                    d[i+3], d[i+4], d[i+5]);
    }
    return 0;
}

我更改了修改了测试套件，以报告每种排序的时钟并运行更多测试（cmp 函数也已更新以处理整数溢出），以下是一些不同架构的结果。我尝试在 AMD cpu 上进行测试，但 rdtsc 在我可用的 X6 1100T 上不可靠。

Clarkdale (i5-650)
==================
Direct call to qsort library function      635.14   575.65   581.61   577.76   521.12
Naive implementation (insertion sort)      538.30   135.36   134.89   240.62   101.23
Insertion Sort (Daniel Stutzbach)          424.48   159.85   160.76   152.01   151.92
Insertion Sort Unrolled                    339.16   125.16   125.81   129.93   123.16
Rank Order                                 184.34   106.58   54.74    93.24    94.09
Rank Order with registers                  127.45   104.65   53.79    98.05    97.95
Sorting Networks (Daniel Stutzbach)        269.77   130.56   128.15   126.70   127.30
Sorting Networks (Paul R)                  551.64   103.20   64.57    73.68    73.51
Sorting Networks 12 with Fast Swap         321.74   61.61    63.90    67.92    67.76
Sorting Networks 12 reordered Swap         318.75   60.69    65.90    70.25    70.06
Reordered Sorting Network w/ fast swap     145.91   34.17    32.66    32.22    32.18

Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function      870.01   736.39   723.39   725.48   721.85
Naive implementation (insertion sort)      503.67   174.09   182.13   284.41   191.10
Insertion Sort (Daniel Stutzbach)          345.32   152.84   157.67   151.23   150.96
Insertion Sort Unrolled                    316.20   133.03   129.86   118.96   105.06
Rank Order                                 164.37   138.32   46.29    99.87    99.81
Rank Order with registers                  115.44   116.02   44.04    116.04   116.03
Sorting Networks (Daniel Stutzbach)        230.35   114.31   119.15   110.51   111.45
Sorting Networks (Paul R)                  498.94   77.24    63.98    62.17    65.67
Sorting Networks 12 with Fast Swap         315.98   59.41    58.36    60.29    55.15
Sorting Networks 12 reordered Swap         307.67   55.78    51.48    51.67    50.74
Reordered Sorting Network w/ fast swap     149.68   31.46    30.91    31.54    31.58

Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function      559.97   451.88   464.84   491.35   458.11
Naive implementation (insertion sort)      341.15   160.26   160.45   154.40   106.54
Insertion Sort (Daniel Stutzbach)          284.17   136.74   132.69   123.85   121.77
Insertion Sort Unrolled                    239.40   110.49   114.81   110.79   117.30
Rank Order                                 114.24   76.42    45.31    36.96    36.73
Rank Order with registers                  105.09   32.31    48.54    32.51    33.29
Sorting Networks (Daniel Stutzbach)        210.56   115.68   116.69   107.05   124.08
Sorting Networks (Paul R)                  364.03   66.02    61.64    45.70    44.19
Sorting Networks 12 with Fast Swap         246.97   41.36    59.03    41.66    38.98
Sorting Networks 12 reordered Swap         235.39   38.84    47.36    38.61    37.29
Reordered Sorting Network w/ fast swap     115.58   27.23    27.75    27.25    26.54

Nehalem (Xeon E5640)
====================
Direct call to qsort library function      911.62   890.88   681.80   876.03   872.89
Naive implementation (insertion sort)      457.69   236.87   127.68   388.74   175.28
Insertion Sort (Daniel Stutzbach)          317.89   279.74   147.78   247.97   245.09
Insertion Sort Unrolled                    259.63   220.60   116.55   221.66   212.93
Rank Order                                 140.62   197.04   52.10    163.66   153.63
Rank Order with registers                  84.83    96.78    50.93    109.96   54.73
Sorting Networks (Daniel Stutzbach)        214.59   220.94   118.68   120.60   116.09
Sorting Networks (Paul R)                  459.17   163.76   56.40    61.83    58.69
Sorting Networks 12 with Fast Swap         284.58   95.01    50.66    53.19    55.47
Sorting Networks 12 reordered Swap         281.20   96.72    44.15    56.38    54.57
Reordered Sorting Network w/ fast swap     128.34   50.87    26.87    27.91    28.02

Looks like I got to the party a year late, but here we go...

Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.

I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.

I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).

#include <stdio.h>

static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
    register int x0,x1,x2,x3,x4,x5;
    x1 = d[1];
    x2 = d[2];
    SWAP(x1, x2);
    x4 = d[4];
    x5 = d[5];
    SWAP(x4, x5);
    x0 = d[0];
    SWAP(x0, x2);
    x3 = d[3];
    SWAP(x3, x5);
    SWAP(x0, x1);
    SWAP(x3, x4);
    SWAP(x1, x4);
    SWAP(x0, x3);
    d[0] = x0;
    SWAP(x2, x5);
    d[5] = x5;
    SWAP(x1, x3);
    d[1] = x1;
    SWAP(x2, x4);
    d[4] = x4;
    SWAP(x2, x3);
    d[2] = x2;
    d[3] = x3;

#undef SWAP
#undef min
#undef max
}

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
    return x;
}

void ran_fill(int n, int *a) {
    static int seed = 76521;
    while (n--) *a++ = (seed = seed *1812433253 + 12345);
}

#define NTESTS 4096
int main() {
    int i;
    int d[6*NTESTS];
    ran_fill(6*NTESTS, d);

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6*NTESTS ; i+=6) {
        sort6_fast(d+i);
    }
    cycles = rdtsc() - cycles;
    printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);

    for (i = 0; i < 6*NTESTS ; i+=6) {
        if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
            printf("d%d : %d %d %d %d %d %d\n", i,
                    d[i+0], d[i+1], d[i+2],
                    d[i+3], d[i+4], d[i+5]);
    }
    return 0;
}

I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.

Clarkdale (i5-650)
==================
Direct call to qsort library function      635.14   575.65   581.61   577.76   521.12
Naive implementation (insertion sort)      538.30   135.36   134.89   240.62   101.23
Insertion Sort (Daniel Stutzbach)          424.48   159.85   160.76   152.01   151.92
Insertion Sort Unrolled                    339.16   125.16   125.81   129.93   123.16
Rank Order                                 184.34   106.58   54.74    93.24    94.09
Rank Order with registers                  127.45   104.65   53.79    98.05    97.95
Sorting Networks (Daniel Stutzbach)        269.77   130.56   128.15   126.70   127.30
Sorting Networks (Paul R)                  551.64   103.20   64.57    73.68    73.51
Sorting Networks 12 with Fast Swap         321.74   61.61    63.90    67.92    67.76
Sorting Networks 12 reordered Swap         318.75   60.69    65.90    70.25    70.06
Reordered Sorting Network w/ fast swap     145.91   34.17    32.66    32.22    32.18

Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function      870.01   736.39   723.39   725.48   721.85
Naive implementation (insertion sort)      503.67   174.09   182.13   284.41   191.10
Insertion Sort (Daniel Stutzbach)          345.32   152.84   157.67   151.23   150.96
Insertion Sort Unrolled                    316.20   133.03   129.86   118.96   105.06
Rank Order                                 164.37   138.32   46.29    99.87    99.81
Rank Order with registers                  115.44   116.02   44.04    116.04   116.03
Sorting Networks (Daniel Stutzbach)        230.35   114.31   119.15   110.51   111.45
Sorting Networks (Paul R)                  498.94   77.24    63.98    62.17    65.67
Sorting Networks 12 with Fast Swap         315.98   59.41    58.36    60.29    55.15
Sorting Networks 12 reordered Swap         307.67   55.78    51.48    51.67    50.74
Reordered Sorting Network w/ fast swap     149.68   31.46    30.91    31.54    31.58

Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function      559.97   451.88   464.84   491.35   458.11
Naive implementation (insertion sort)      341.15   160.26   160.45   154.40   106.54
Insertion Sort (Daniel Stutzbach)          284.17   136.74   132.69   123.85   121.77
Insertion Sort Unrolled                    239.40   110.49   114.81   110.79   117.30
Rank Order                                 114.24   76.42    45.31    36.96    36.73
Rank Order with registers                  105.09   32.31    48.54    32.51    33.29
Sorting Networks (Daniel Stutzbach)        210.56   115.68   116.69   107.05   124.08
Sorting Networks (Paul R)                  364.03   66.02    61.64    45.70    44.19
Sorting Networks 12 with Fast Swap         246.97   41.36    59.03    41.66    38.98
Sorting Networks 12 reordered Swap         235.39   38.84    47.36    38.61    37.29
Reordered Sorting Network w/ fast swap     115.58   27.23    27.75    27.25    26.54

Nehalem (Xeon E5640)
====================
Direct call to qsort library function      911.62   890.88   681.80   876.03   872.89
Naive implementation (insertion sort)      457.69   236.87   127.68   388.74   175.28
Insertion Sort (Daniel Stutzbach)          317.89   279.74   147.78   247.97   245.09
Insertion Sort Unrolled                    259.63   220.60   116.55   221.66   212.93
Rank Order                                 140.62   197.04   52.10    163.66   153.63
Rank Order with registers                  84.83    96.78    50.93    109.96   54.73
Sorting Networks (Daniel Stutzbach)        214.59   220.94   118.68   120.60   116.09
Sorting Networks (Paul R)                  459.17   163.76   56.40    61.83    58.69
Sorting Networks 12 with Fast Swap         284.58   95.01    50.66    53.19    55.47
Sorting Networks 12 reordered Swap         281.20   96.72    44.15    56.38    54.57
Reordered Sorting Network w/ fast swap     128.34   50.87    26.87    27.91    28.02

回复收藏 0 原文

老子叫无熙 2024-09-08 09:03:44

测试代码非常糟糕；它溢出了初始数组（这里的人不读编译器警告吗？），printf 打印出错误的元素，它使用 .byte 作为 rdtsc 没有充分的理由，只有一次运行（！），没有任何检查最终结果实际上是正确的（因此很容易“优化”成一些微妙的错误），所包含的测试非常基本（没有负数？）并且没有什么可以阻止编译器将整个函数作为死代码丢弃。

话虽如此，改进双调网络解决方案也很容易；只需将 min/max/SWAP 更改为即可

#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }

，对我来说速度快了 65%（Debian gcc 4.4.5 with -O2、amd64、Core i7）。

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.

That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to

#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }

and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

回复收藏 0 原文

愿得七秒忆 2024-09-08 09:03:44

几天前我在 Google 上偶然发现了这个问题，因为我也需要快速对 6 个整数的固定长度数组进行排序。然而，就我而言，我的整数只有 8 位（而不是 32 位），并且我没有严格要求只使用 C。我想无论如何我都会分享我的发现，以防它们可能对某人有帮助......

我在汇编中实现了网络排序的变体，该变体使用 SSE 尽可能向量化比较和交换操作。需要六次“传递”才能对数组进行完全排序。我使用了一种新颖的机制来直接将 PCMPGTB（向量化比较）的结果转换为 PSHUFB（向量化交换）的无序参数，仅使用 PADDB（向量化加法），在某些情况下还使用 PAND（按位与）指令。

这种方法还具有产生真正无分支函数的副作用。没有任何跳转指令。

看来此实现比当前在问题中标记为最快选项的实现（“使用简单交换对网络 12 进行排序”）快约 38%。我修改了该实现以在测试期间使用 char 数组元素，以使比较公平。

我应该指出，这种方法可以应用于最多 16 个元素的任何数组大小。我预计对于更大的阵列，相对于替代方案的相对速度优势会变得更大。

该代码是用 MASM 编写的，适用于具有 SSSE3 的 x86_64 处理器。该函数使用“新”Windows x64 调用约定。就是这样...

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

您可以将其编译为可执行对象并将其链接到您的 C 项目中。有关如何在 Visual Studio 中执行此操作的说明，您可以阅读这篇文章。您可以使用以下 C 原型从 C 代码中调用该函数：

void simd_sort_6(char *values);

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...

I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.

This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.

It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.

I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.

The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code: