VC++ SSE 内在优化怪异
我正在从文件中执行 8 位数据的分散读取(解交错 64 通道波形文件)。然后我将它们组合成一个字节流。我遇到的问题是重新构建要写出的数据。
基本上,我读取 16 个字节,然后将它们构建到单个 __m128i 变量中,然后使用 _mm_stream_ps 将值写回内存。然而我有一些奇怪的性能结果。
在我的第一个方案中,我使用 _mm_set_epi8 内在函数来设置我的 __m128i,如下所示:
const __m128i packedSamples = _mm_set_epi8( sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8,
sample7, sample6, sample5, sample4, sample3, sample2, sample1, sample0 );
基本上,我将其全部留给编译器来决定如何优化它以提供最佳性能。这给出了最差的性能。我的测试运行时间约为 0.195 秒。
其次,我尝试使用 4 个 _mm_set_epi32 指令进行合并,然后将它们打包:
const __m128i samples0 = _mm_set_epi32( sample3, sample2, sample1, sample0 );
const __m128i samples1 = _mm_set_epi32( sample7, sample6, sample5, sample4 );
const __m128i samples2 = _mm_set_epi32( sample11, sample10, sample9, sample8 );
const __m128i samples3 = _mm_set_epi32( sample15, sample14, sample13, sample12 );
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
这确实在一定程度上提高了性能。我的测试现在运行时间约为 0.15 秒。通过这样做似乎违反直觉,性能会提高,因为我认为这正是 _mm_set_epi8 正在做的事情......
我最后的尝试是使用我从老式方式制作四个 CC 中获得的一些代码(使用移位和或) ),然后使用单个 _mm_set_epi32 将它们放入 __m128i 中。
const GCui32 samples0 = MakeFourCC( sample0, sample1, sample2, sample3 );
const GCui32 samples1 = MakeFourCC( sample4, sample5, sample6, sample7 );
const GCui32 samples2 = MakeFourCC( sample8, sample9, sample10, sample11 );
const GCui32 samples3 = MakeFourCC( sample12, sample13, sample14, sample15 );
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
这提供了更好的性能。运行测试大约需要 0.135 秒。我真的开始困惑了。
所以我尝试了一个简单的读取字节写入字节系统,它甚至比最后一种方法还要快一些。
那么这是怎么回事呢?这一切对我来说似乎违反直觉。
我考虑过 _mm_stream_ps 上发生延迟的想法是因为我提供数据的速度太快,但无论我做什么,我都会得到完全相同的结果。前 2 种方法是否可能意味着 16 个负载无法通过循环分配以隐藏延迟?如果是的话这是为什么?当然,内在函数允许编译器根据需要进行优化..我认为这就是重点...当然,执行 16 次读取和 16 次写入将比 16 次读取和 1 次写入慢得多,并且有一堆 SSE 杂耍指令...毕竟读取和写入是最慢的一点!
任何对正在发生的事情有任何想法的人将不胜感激! :D
编辑:根据下面的评论,我停止将字节预加载为常量,并将其更改为:
const __m128i samples0 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples1 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples2 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples3 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
这将性能提高到约 0.143 秒。 Sitll 不如直接的 C 实现...
再次编辑:迄今为止我获得的最佳性能是
// Load the samples.
const GCui8 sample0 = *(pSamples + channelStep0);
const GCui8 sample1 = *(pSamples + channelStep1);
const GCui8 sample2 = *(pSamples + channelStep2);
const GCui8 sample3 = *(pSamples + channelStep3);
const GCui32 samples0 = Build32( sample0, sample1, sample2, sample3 );
pSamples += channelStep4;
const GCui8 sample4 = *(pSamples + channelStep0);
const GCui8 sample5 = *(pSamples + channelStep1);
const GCui8 sample6 = *(pSamples + channelStep2);
const GCui8 sample7 = *(pSamples + channelStep3);
const GCui32 samples1 = Build32( sample4, sample5, sample6, sample7 );
pSamples += channelStep4;
// Load the samples.
const GCui8 sample8 = *(pSamples + channelStep0);
const GCui8 sample9 = *(pSamples + channelStep1);
const GCui8 sample10 = *(pSamples + channelStep2);
const GCui8 sample11 = *(pSamples + channelStep3);
const GCui32 samples2 = Build32( sample8, sample9, sample10, sample11 );
pSamples += channelStep4;
const GCui8 sample12 = *(pSamples + channelStep0);
const GCui8 sample13 = *(pSamples + channelStep1);
const GCui8 sample14 = *(pSamples + channelStep2);
const GCui8 sample15 = *(pSamples + channelStep3);
const GCui32 samples3 = Build32( sample12, sample13, sample14, sample15 );
pSamples += channelStep4;
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
_mm_stream_ps( pWrite + 0, *(__m128*)&packedSamples );
这使我在大约 0.095 秒内进行处理,这要好得多。不过,我似乎无法接近 SSE……我仍然对此感到困惑,但是……呵呵。
I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out.
Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using _mm_stream_ps to write the value back out to memory. However I have some odd performance results.
In my first scheme I use the _mm_set_epi8 intrinsic to set my __m128i as follows:
const __m128i packedSamples = _mm_set_epi8( sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8,
sample7, sample6, sample5, sample4, sample3, sample2, sample1, sample0 );
Basically I leave it all up to the compiler to decide how to optimise it to give best performance. This gives WORST performance. MY test runs in ~0.195 seconds.
Second I tried to merge down by using 4 _mm_set_epi32 instructions and then packing them down:
const __m128i samples0 = _mm_set_epi32( sample3, sample2, sample1, sample0 );
const __m128i samples1 = _mm_set_epi32( sample7, sample6, sample5, sample4 );
const __m128i samples2 = _mm_set_epi32( sample11, sample10, sample9, sample8 );
const __m128i samples3 = _mm_set_epi32( sample15, sample14, sample13, sample12 );
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
This does improve performance somewhat. My test now runs in ~0.15 seconds. Seems counter-intuitive that performance would improve by doing this as I assume this is exactly what _mm_set_epi8 is doing anyway ...
My final attempt was to use a bit of code I have from making four CCs the old fashioned way (with shifts and ors) and then putting them in an __m128i using a single _mm_set_epi32.
const GCui32 samples0 = MakeFourCC( sample0, sample1, sample2, sample3 );
const GCui32 samples1 = MakeFourCC( sample4, sample5, sample6, sample7 );
const GCui32 samples2 = MakeFourCC( sample8, sample9, sample10, sample11 );
const GCui32 samples3 = MakeFourCC( sample12, sample13, sample14, sample15 );
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
This gives even BETTER performance. Taking ~0.135 seconds to run my test. I'm really starting to get confused.
So I tried a simple read byte write byte system and that is ever-so-slightly faster than even the last method.
So what is going on? This all seems counter-intuitive to me.
I've considered the idea that the delays are occuring on the _mm_stream_ps because I'm supplying data too quickly but then I would to get exactly the same results out whatever I do. Is it possible that the first 2 methods mean that the 16 loads can't get distributed through the loop to hide latency? If so why is this? Surely an intrinsic allows the compiler to make optimisations as and where it pleases .. i thought that was the whole point ... Also surely performing 16 reads and 16 writes will be much slower than 16 reads and 1 write with a bunch of SSE juggling instructions ... After all its the reads and writes that are the slow bit!
Anyone with any ideas whats going on will be much appreciated! :D
Edit: Further to the comment below I stopped pre-loading the bytes as constants and changedit to this:
const __m128i samples0 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples1 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples2 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples3 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
and this improved performance to ~0.143 seconds. Sitll not as good as the straight C implementation ...
Edit Again: The best performance I'm getting thus far is
// Load the samples.
const GCui8 sample0 = *(pSamples + channelStep0);
const GCui8 sample1 = *(pSamples + channelStep1);
const GCui8 sample2 = *(pSamples + channelStep2);
const GCui8 sample3 = *(pSamples + channelStep3);
const GCui32 samples0 = Build32( sample0, sample1, sample2, sample3 );
pSamples += channelStep4;
const GCui8 sample4 = *(pSamples + channelStep0);
const GCui8 sample5 = *(pSamples + channelStep1);
const GCui8 sample6 = *(pSamples + channelStep2);
const GCui8 sample7 = *(pSamples + channelStep3);
const GCui32 samples1 = Build32( sample4, sample5, sample6, sample7 );
pSamples += channelStep4;
// Load the samples.
const GCui8 sample8 = *(pSamples + channelStep0);
const GCui8 sample9 = *(pSamples + channelStep1);
const GCui8 sample10 = *(pSamples + channelStep2);
const GCui8 sample11 = *(pSamples + channelStep3);
const GCui32 samples2 = Build32( sample8, sample9, sample10, sample11 );
pSamples += channelStep4;
const GCui8 sample12 = *(pSamples + channelStep0);
const GCui8 sample13 = *(pSamples + channelStep1);
const GCui8 sample14 = *(pSamples + channelStep2);
const GCui8 sample15 = *(pSamples + channelStep3);
const GCui32 samples3 = Build32( sample12, sample13, sample14, sample15 );
pSamples += channelStep4;
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
_mm_stream_ps( pWrite + 0, *(__m128*)&packedSamples );
This gives me processing in ~0.095 seconds which is considerably better. I don't appear to be able to get close with SSE though ... I'm still confused by that but .. ho hum.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
也许编译器试图将内部函数的所有参数立即放入寄存器中。您不想在没有组织的情况下一次访问那么多变量。
不要为每个样本声明单独的标识符,而是尝试将它们放入
char[16]
中。只要您不获取数组中任何内容的地址,编译器就会将其认为合适的 16 个值提升到寄存器。您可以添加一个__aligned__
标记(或 VC++ 使用的任何标记),并且可能完全避免使用内在函数。否则,使用(sample[15],sample[14],sample[13]…sample[0])
调用内在函数应该会使编译器的工作更容易,或者至少不会造成任何损害。编辑:我很确定您正在对抗寄存器溢出,但该建议可能只是单独存储字节,这不是您想要的。我认为我的建议是将您的最终尝试(使用 MakeFourCC)与读取操作交错,以确保它的安排正确并且没有往返堆栈的情况。当然,检查目标代码是确保这一点的最佳方法。
本质上,您将数据流式传输到寄存器文件中,然后将其流式输出。您不想在刷新数据之前使其过载。
Perhaps the compiler is trying to put all the arguments to the intrinsic into registers at once. You don't want to access that many variables at once without organizing them.
Rather than declare a separate identifier for each sample, try putting them into a
char[16]
. The compiler will promote the 16 values to registers as it sees fit, as long as you don't take the address of anything within the array. You can add an__aligned__
tag (or whatever VC++ uses) and maybe avoid the intrinsic altogether. Otherwise, calling the intrinsic with( sample[15], sample[14], sample[13] … sample[0] )
should make the compiler's job easier or at least do no harm.Edit: I'm pretty sure you're fighting a register spill but that suggestion will probably just store the bytes individually, which isn't what you want. I think my advice is to interleave your final attempt (using MakeFourCC) with the read operations, to make sure it's scheduled correctly and with no round-trips to the stack. Of course, inspection of object code is the best way to ensure that.
Essentially, you are streaming data into the register file and then streaming it back out. You don't want to overload it before it's time to flush the data.
众所周知,VS 不擅长优化内在函数。特别是从 SSE 寄存器移出数据或移入 SSE 寄存器。然而,内在函数本身使用得很好......
您看到的是,它正在尝试用这个怪物填充 SSE 寄存器:
这效果更好,并且(应该)很容易更快:
构建我自己的测试台:
对我来说,测试 2 比测试 1 更快。
我需要做些什么吗?错误的?这不是您正在使用的代码吗?我想念什么?这只是为了我吗?
VS is notoriously bad at optimizing intrinsics. Especially moving data from and to SSE registers. The intrinsics itself are used pretty well however ... .
What you see is that it is trying to fill the SSE register with this monster :
This works much better and (should) easily be faster :
Build my own test-bed :
For me test 2 is faster then test 1.
Do I do something wrong? Is this not the code you are using? What do I miss? Is this just for me?
使用内在函数会破坏编译器优化!
内部函数的全部要点是将编译器不知道的操作码插入到编译器确实知道并生成的操作码流中。除非为编译器提供了一些有关操作码及其如何影响寄存器和内存的元数据,否则编译器不能假设在执行内部函数后会保留任何数据。这确实损害了编译器的优化部分 - 它无法围绕内在函数重新排序指令,它不能假设寄存器不受影响等等。
我认为优化这一点的最佳方法是着眼于大局——你需要考虑从读取源数据到写入最终输出的整个过程。微观优化很少能带来大的结果,除非你一开始就做得很糟糕。
也许,如果您详细说明所需的输入和输出,这里有人可以建议处理它的最佳方法。
Using intrinsics breaks compiler optimisations!
The whole point of the intrinsic functions is to insert opcodes the compiler doesn't know about into the stream of opcodes the compiler does know about and has generated. Unless the compiler is given some meta data about the opcode and how it affects the registers and memory, the compiler can't assume that any data is preserved after executing the intrinsic. This really hurts the optimising part of the compiler - it can't reorder instructions around the intrinsic, it can't assume registers are unaffected and so on.
I think the best way to optimise this is to look at the bigger picture - you need to consider the whole process from reading the source data to writing the final output. Micro optimisations rarely give big results, unless you're doing something really badly to start with.
Perhaps, if you detail the required input and output someone here could suggest an optimal method to handle it.