一般来说,我在“网上”遇到的与 SSE/MMX 相关的所有内容都是向量和矩阵的数学内容。但是,我正在寻找 SSE 优化的“标准函数”库,例如 Agner Fog,或者GCC中一些基于SSE的字符串扫描算法。
作为一个快速的总体概述:这些将是像 memset、memcpy、strstr、memcmp BSR/BSF 之类的东西,即从 SSE 指令构建的 stdlib 式的
我希望它们使用内在函数而不是汇编用于 SSE1(正式的 MMX2) ,但两者都可以。希望这不是太宽泛的范围。
更新1
经过一番搜索,我发现了一些有前途的东西,一个库引起了我的注意:
- LibFreeVec:似乎仅限 mac/IBM(由于基于 AltiVec),因此没什么用处(对我来说),而且我似乎找不到直接下载链接,也没有说明支持的最低 SSE 版本
我还遇到了一篇关于一些矢量化字符串函数的文章(strlen、 strstr strcmp)。然而,SSE4.2 远远超出了我的能力范围(如前所述,我想坚持使用 SSE1/MMX)。
更新 2
Paul R 激励我做一些基准测试,不幸的是,由于我的 SSE 汇编编码经验接近 zip,所以我使用了其他人的 (http://www.mindcontrol.org/~hplus/) 基准测试代码并添加到其中。所有测试(不包括原始版本,即 VC6 SP5)均在 VC9 SP1 下编译,具有完整/自定义优化和 /arch:SSE
。
第一个测试是我的家用计算机(AMD Sempron 2200+ 512mb DDR 333),上限为 SSE1(因此 MSVC memcpy 不会进行矢量化):
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 2879 506.75 MB/s 4132 353.08 MB/s 2655 549.51 MB/s
2 kB 4877 598.29 MB/s 7041 414.41 MB/s 5179 563.41 MB/s
4 kB 8890 656.44 MB/s 13123 444.70 MB/s 9832 593.55 MB/s
8 kB 17413 670.28 MB/s 25128 464.48 MB/s 19403 601.53 MB/s
16 kB 34569 675.26 MB/s 48227 484.02 MB/s 38303 609.43 MB/s
32 kB 68992 676.69 MB/s 95582 488.44 MB/s 75969 614.54 MB/s
64 kB 138637 673.50 MB/s 195012 478.80 MB/s 151716 615.44 MB/s
128 kB 277678 672.52 MB/s 400484 466.30 MB/s 304670 612.94 MB/s
256 kB 565227 660.78 MB/s 906572 411.98 MB/s 618394 603.97 MB/s
512 kB 1142478 653.82 MB/s 1936657 385.70 MB/s 1380146 541.23 MB/s
1024 kB 2268244 658.64 MB/s 3989323 374.49 MB/s 2917758 512.02 MB/s
2048 kB 4556890 655.69 MB/s 8299992 359.99 MB/s 6166871 484.51 MB/s
4096 kB 9307132 642.07 MB/s 16873183 354.16 MB/s 12531689 476.86 MB/s
全面测试
第二批测试在大学工作站上完成(Intel E6550,2.33Ghz,2GB DDR2 800?)
VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 392 5797.69 MB/s 434 5236.63 MB/s 420 5411.18 MB/s
2 kB 882 5153.51 MB/s 707 6429.13 MB/s 714 6366.10 MB/s
4 kB 2044 4447.55 MB/s 1218 7463.70 MB/s 1218 7463.70 MB/s
8 kB 3941 4613.44 MB/s 2170 8378.60 MB/s 2303 7894.73 MB/s
16 kB 7791 4667.33 MB/s 4130 8804.63 MB/s 4410 8245.61 MB/s
32 kB 15470 4701.12 MB/s 7959 9137.61 MB/s 8708 8351.66 MB/s
64 kB 30716 4735.40 MB/s 15638 9301.22 MB/s 17458 8331.57 MB/s
128 kB 61019 4767.45 MB/s 31136 9343.05 MB/s 35259 8250.52 MB/s
256 kB 122164 4762.53 MB/s 62307 9337.80 MB/s 72688 8004.21 MB/s
512 kB 246302 4724.36 MB/s 129577 8980.15 MB/s 142709 8153.80 MB/s
1024 kB 502572 4630.66 MB/s 332941 6989.95 MB/s 290528 8010.38 MB/s
2048 kB 1105076 4211.91 MB/s 1384908 3360.86 MB/s 662172 7029.11 MB/s
4096 kB 2815589 3306.22 MB/s 4342289 2143.79 MB/s 2172961 4284.00 MB/s
完整测试
可以看出,SSE 在我的家庭系统上速度非常快,但在 intel 机器上表现不佳(可能是由于编码错误?)。我的 x86 汇编变体在我的家用机器上排名第二,在英特尔系统上排名第二(但结果看起来有点不一致,一拥抱就阻止了它在 SSE1 版本中占主导地位)。 MSVC memcpy 赢得了英特尔系统测试的胜利,这是由于 SSE2 矢量化,但在我的家用机器上,它惨败,甚至可怕的 __movsd 都击败了它......
陷阱:内存全部耗尽2. 缓存被(希望)刷新。 rdtsc 用于计时。
兴趣点:MSVC 有一个(未在任何参考中列出)__movsd内在的,它输出与我正在使用的相同的汇编代码,但它失败了(即使是内联的!)。这可能就是它未上市的原因。
VC9 memcpy 可以在我的非 sse 2 机器上强制进行矢量化,但是它会损坏 FPU 堆栈,它似乎也有一个错误。
这是我用来测试的内容的完整源代码(再次包括我的修改,归功于http://www.mindcontrol.org/~hplus/ 为原始内容)。项目文件的二进制文件可根据要求提供。
总之,似乎切换变体可能是最好的,类似于 MSVC crt 变体,只是更坚固,有更多选项和单次检查(通过内联函数指针?或更狡猾的东西,如内部直接调用补丁),但是内联可能必须使用最佳情况方法
更新3
Eshan提出的问题提醒了一些有用且与此相关的东西,尽管仅适用于位集和位操作,BitMagic 对于大型位集非常有用,它甚至有一篇关于SSE2(位)优化。不幸的是,这仍然不是 CRT/stdlib esque 类型库。似乎这些项目中的大多数都致力于解决特定的一小部分(问题)。
这就提出了一个问题,那么是否值得创建一个开源的、可能是多平台性能的 crt/stdlib 项目,创建标准化函数的各种版本,每个版本都针对特定情况以及“最佳情况”进行优化'/函数的通用变体,具有标量/MMX/SSE/SSE2+(类似于 MSVC)的运行时分支或强制编译时标量/SIMD 开关。
这对于 HPC 或每一位性能都很重要的项目(如游戏)可能很有用,使程序员不必担心内置函数的速度,只需要进行少量调整即可找到最佳的优化变体。
更新4
我认为这个问题的性质应该扩展,包括可以使用SSE/MMX来优化非矢量/矩阵应用程序的技术,这可能用于32/64位标量代码也是如此。一个很好的例子是如何立即使用标量技术(位操作)、MMX 和 MMX 检查给定 32/64/128/256 位数据类型中是否出现字节。 SSE/SIMD
另外,我看到很多“只使用 ICC”的答案,这是一个很好的答案,这不是我的答案,因为首先,ICC 不是我可以连续使用的东西(除非英特尔有一个Windows 的免费学生版),由于有 30 次试用期。其次,更相关的是,我不仅追求库本身,而且追求用于优化/创建它们所包含的函数的技术,以供我个人启发和改进,因此我可以将这些技术和原则应用到我自己的代码中(如果需要),结合使用这些库。希望这能澄清那部分:)
Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.
As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions
I'd preferably like them to be for SSE1 (formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.
Update 1
I came across some promising stuff after some searching, one library caught my eye:
- LibFreeVec: seems mac/IBM only (due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version
I also came across an article on a few vectorised string functions(strlen, strstr strcmp). However SSE4.2 is way out of my reach (as said before, I'd like to stick to SSE1/MMX).
Update 2
Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's (http://www.mindcontrol.org/~hplus/) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE
on.
First test was one my home machine (AMD Sempron 2200+ 512mb DDR 333), capped at SSE1 (thus no vectorization by MSVC memcpy):
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 2879 506.75 MB/s 4132 353.08 MB/s 2655 549.51 MB/s
2 kB 4877 598.29 MB/s 7041 414.41 MB/s 5179 563.41 MB/s
4 kB 8890 656.44 MB/s 13123 444.70 MB/s 9832 593.55 MB/s
8 kB 17413 670.28 MB/s 25128 464.48 MB/s 19403 601.53 MB/s
16 kB 34569 675.26 MB/s 48227 484.02 MB/s 38303 609.43 MB/s
32 kB 68992 676.69 MB/s 95582 488.44 MB/s 75969 614.54 MB/s
64 kB 138637 673.50 MB/s 195012 478.80 MB/s 151716 615.44 MB/s
128 kB 277678 672.52 MB/s 400484 466.30 MB/s 304670 612.94 MB/s
256 kB 565227 660.78 MB/s 906572 411.98 MB/s 618394 603.97 MB/s
512 kB 1142478 653.82 MB/s 1936657 385.70 MB/s 1380146 541.23 MB/s
1024 kB 2268244 658.64 MB/s 3989323 374.49 MB/s 2917758 512.02 MB/s
2048 kB 4556890 655.69 MB/s 8299992 359.99 MB/s 6166871 484.51 MB/s
4096 kB 9307132 642.07 MB/s 16873183 354.16 MB/s 12531689 476.86 MB/s
full tests
Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)
VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 392 5797.69 MB/s 434 5236.63 MB/s 420 5411.18 MB/s
2 kB 882 5153.51 MB/s 707 6429.13 MB/s 714 6366.10 MB/s
4 kB 2044 4447.55 MB/s 1218 7463.70 MB/s 1218 7463.70 MB/s
8 kB 3941 4613.44 MB/s 2170 8378.60 MB/s 2303 7894.73 MB/s
16 kB 7791 4667.33 MB/s 4130 8804.63 MB/s 4410 8245.61 MB/s
32 kB 15470 4701.12 MB/s 7959 9137.61 MB/s 8708 8351.66 MB/s
64 kB 30716 4735.40 MB/s 15638 9301.22 MB/s 17458 8331.57 MB/s
128 kB 61019 4767.45 MB/s 31136 9343.05 MB/s 35259 8250.52 MB/s
256 kB 122164 4762.53 MB/s 62307 9337.80 MB/s 72688 8004.21 MB/s
512 kB 246302 4724.36 MB/s 129577 8980.15 MB/s 142709 8153.80 MB/s
1024 kB 502572 4630.66 MB/s 332941 6989.95 MB/s 290528 8010.38 MB/s
2048 kB 1105076 4211.91 MB/s 1384908 3360.86 MB/s 662172 7029.11 MB/s
4096 kB 2815589 3306.22 MB/s 4342289 2143.79 MB/s 2172961 4284.00 MB/s
full tests
As can be seen, SSE is very fast on my home system, but falls on the intel machine (probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system (but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd
beats it...
pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.
points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). That's probably why its unlisted.
VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.
This is the full source to what I used to test (including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the original). The binaries an project files are available on request.
In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks (via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead
Update 3
A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).
This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimized for certain situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (à la MSVC) or a forced compile time scalar/SIMD swich.
This could be useful for HPC, or projects where every bit of performance counts (like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.
Update 4
I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. A good example is how to check for the occurence of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD
Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuously (unless Intel have a free student version for windows), due to the 30 trial. secondly, and more pertinently, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code (where needed), in conjunction with the use of these libraries. hopefully that clears up that part :)
发布评论
评论(8)
这里有一篇关于如何使用 SIMD 指令向量化字符计数的文章:
http:// /porg.es/blog/ridiculous-utf-8-character-counting
Here's an article on how to use SIMD instructions to vectorize the counting of characters:
http://porg.es/blog/ridiculous-utf-8-character-counting
对于 memset、memcpy 等计算量很少的简单操作,SIMD 优化没有什么意义,因为内存带宽通常是限制因素。
For simple operations such as memset, memcpy, etc, where there is very little computation, there is little point in SIMD optimisation, since memory bandwidth will usually be the limiting factor.
也许是 libSIMDx86?
http://simdx86.sourceforge.net
Maybe libSIMDx86?
http://simdx86.sourceforge.net
您可以使用苹果或 OpenSolaris 的 libc。这些 libc 实现包含您正在寻找的内容。大约六年前,我一直在寻找这类东西,但我不得不痛苦地把它写下来。
很久以前,我记得参加过一个名为“fastcode”项目的编码竞赛。他们当时使用 Delphi 做了一些很棒的突破性优化。查看他们的结果页面。由于它是用 Pascal 的快速函数调用模型(将参数复制到寄存器)编写的,因此转换为 C 风格的 stdc 函数调用模型(压入堆栈)可能有点尴尬。这个项目已经很久没有更新了,尤其是没有为SSE4.2编写代码。
You can use the apple's or OpenSolaris's libc. These libc implementations contain what you are looking for. I was looking for these kind of things some 6 years back and I had to painfully write it the hard-way.
Ages ago I remember following a coding contest called 'fastcode' project. They did some awesome ground breaking optimisation for that time using Delphi. See their results page. Since it is written in Pascal's fast function call-model (copying arguments to registers) converting to C styled stdc function call-models (pushing on stack) may be a bit awkward. This project has no updates since a long-time especially, no code is written for SSE4.2.
这是一个用 C 编写的快速 memcpy 实现,如有必要,可以替换标准库版本的 memcpy:
http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
Here's a fast memcpy implementation in C that can replace the standard library version of memcpy if necessary:
http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
老实说,我要做的只是安装英特尔 C++ 编译器并了解各种可用的自动 SIMD 优化标志。通过简单地使用 ICC 进行编译,我们在优化代码性能方面获得了非常好的经验。
请记住,整个 STL 库基本上只是头文件,因此整个内容都编译到您的 exe/lib/dll 中,因此可以根据您的喜好进行优化。
ICC 有许多选项,可让您(最简单地)指定要定位的 SSE 级别。您还可以使用它生成具有多个代码路径的二进制文件,这样,如果您编译的最佳 SSE 配置不可用,它将运行为能力较差的 SIMD CPU 配置的另一组(仍优化的)代码。
Honestly, what I would do is just install the Intel C++ Compiler and learn the various automated SIMD optimization flags available. We've had very good experience optimizing code performance by simply compiling it with ICC.
Keep in mind that the entire STL library is basically just header files, so the whole thing is compiled into your exe/lib/dll, and as such can be optimized however you like.
ICC has many options and lets you specify (at the simplest) which SSE levels to target. You can also use it to generate a binary file with multiple code paths, such that if the optimal SSE configuration you compiled against isn't available, it'll run a different set of (still optimized) code configured for a less capable SIMD CPU.
strstr 很难优化,因为 (a) \0 终止意味着无论如何你都必须读取每个字节,
(b)它也必须在所有边缘情况下都表现良好。
话虽如此,使用 SSE2 操作,您可以将标准 strstr 击败 10 倍。
我注意到 gcc 4.4 现在将这些操作用于 strlen,但不用于
其他字符串操作。
有关如何将 SSE2 寄存器用于 strlen、strchr、strpbrk 等的更多信息。
在 mischasan.wordpress.com。请原谅我的超级简洁的代码布局。
strstr is hard to optimize because (a) \0-termination means you have to read every byte anyway,
and (b) it has to be good on all the edge cases, too.
With that said, you can beat standard strstr by a factor of 10, using SSE2 ops.
I've noticed that gcc 4.4 uses these ops for strlen now, but not for the
other string ops.
More on how to use SSE2 registers for strlen, strchr, strpbrk, etc.
at mischasan.wordpress.com. Pardon my super-terse code layout.
我个人不会费心去尝试编写超级优化版本的 libc 函数来尝试以良好的性能处理每种可能的情况。
相反,为特定情况编写优化版本,在这种情况下,您对当前的问题有足够的了解,可以编写正确的代码......以及重要的地方。
memset
和ClearLargeBufferCacheWriteThrough
之间存在语义差异。I personally wouldn't bother trying to write super-optimized versions of libc functions trying to handle every possible scenario with good performance.
Instead, write optimized versions for specific situations, where you know enough about the problem at hand to write proper code... and where it matters. There's a semantic difference between
memset
andClearLargeBufferCacheWriteThrough
.