阅读后,我们何时应该使用预取的答案? com/q/Q/732794/15853075“>预摘要的示例?,我仍然有很多问题在理解何时实际使用预取。尽管这些答案提供了一个预购有用的示例,但它们并没有解释如何在实际程序中发现它。看起来像是随机的猜测。
特别是,我对通过GCC的 __ nidentin_prefetch Intinsic访问的Intel X86(prefetchna,prefetcht2,prefetcht2,prefetcht2,prefetcht1,prefetcht0,prefetchw)感兴趣。我想知道:
- 如何看到软件预取可以为我的特定程序提供帮助?我想我可以使用Intel VTune或Linux实用程序
perf
来收集CPU分析指标(例如,高速缓存数量)。在这种情况下,什么指标(或它们之间的关系)表明有机会通过软件预取用来提高性能?
- 我如何找到遭受高速缓存的负载最多?
- 如何查看碰巧碰巧决定使用哪种预取(0,1,2)的高速缓存级别?
- 假设我发现特定的负载在特定的缓存级别上遭受了失误,我应该在哪里放置预取?例如,假设下一个循环遭受高速缓存误差,
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
}
我应该在加载 a [i]
之前放置预取。它应该指向 a [i+m]
的距离?我是否需要担心展开循环,以确保我仅在缓存线边界上进行预摘要,或者如果数据已经在CACHE中,它几乎可以像 nop
一样免费?是否值得在连续使用多个 _______prefetch
调用一次来一次预取多个缓存线?
After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.
In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch
intrinsic. I would like to know:
- How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility
perf
. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
- How I can locate the loads that suffer from cache misses the most?
- How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
- Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
}
Should I place prefetch before or after the load a[i]
? How far ahead it should point a[i+m]
? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop
if data is already in cache? Is it worth to use multiple __builtin_prefetch
calls in a row to prefetch multiple cache lines at once?
发布评论
评论(4)
您可以检查缓存错过的比例。
perf
或vtune可以用硬件性能计数器。例如,您可以使用perf list
获得列表。该列表取决于目标处理器体系结构,但有一些通用事件。例如,L1-DCACHE-LOAD-MISSES
,llc-load-Misses
和llc-ers-ers-erase-misses
。除非您还获得了负载/存储的数量,否则拥有缓存量的数量不是很有用。有通用计数器,例如l1-dcache-loads
,llc-loads
或llc-stores
。对于L2,AFAIK没有通用计数器(至少在英特尔处理器上),您需要使用特定的硬件计数器(例如,在Intel Skylake类似于l2_rqsts.miss
)。要获取整体统计信息,您可以使用perf Stat -e an_hardware_counter,另一个_one your_program
。可以找到一个好的文档在这里。当错过的比例很大时,您应该尝试优化代码,但这只是一个提示。实际上,关于您的应用程序,您可能会受到许多缓存的命中率,但是在您的应用程序的关键部分/时间中,许多缓存错过了。结果,缓存失误可能会在其他所有人中丢失。对于与SIMD相比,在标量代码中大量的L1缓存参考尤其如此。一种解决方案是仅介绍应用程序的特定部分并使用知识,以便朝着良好的方向进行调查。性能计数器实际上并不是自动在程序中搜索问题的工具,而是一种工具,可以帮助您验证/反驳一些假设或给出一些有关正在发生的事情的提示 。它为您提供了解决一个神秘案件的证据,但侦探取决于您完成所有工作。
一些硬件性能计数器是“ 精确”,这意味着可以找到生成事件的指令。这是非常有用的,因为您可以判断哪些说明是最大的缓存失误负责的(尽管在实践中并不总是精确的)。您可以使用
Perf Record
+perf Report
,以获取信息(有关更多信息,请参见上一个教程)。请注意,有很多原因会导致缓存失误,并且只能通过使用软件预取用来解决少数情况。
实际上,这通常很难选择,并且非常依赖您的应用程序。从理论上讲,数字是一个提示可以告诉处理器是否目标缓存线的位置水平(例如获取到L1,L2或L3 Cache中)。例如,如果您知道应该尽快阅读和重复使用数据,那么最好将其放入L1。但是,如果使用L1,并且您不想仅使用一次使用的数据(或很少使用)来污染它,则最好将数据获取到较低的缓存中。在实践中,这有点复杂,因为从一个体系结构到另一种体系结构的行为可能不相同...参见
_mm_prefetch()
局部提示?有关更多信息。用法的一个示例是这个问题。软件预取的用于避免通过某些特定步伐的缓存垃圾问题。这是一种病理案例,硬件预摘要不是很有用。
这显然是最棘手的部分。您应该足够早早预拿起缓存线,以使延迟显着降低,否则该说明是没有用的,实际上可能是有害的。确实,该指令在程序中占用一些空间,需要解码,并使用可用于执行其他(更关键的)加载指令的负载端口。但是,如果为时已晚,则可以驱逐缓存线并需要重新加载...
通常的解决方案是编写这样的代码:
其中
magic_distance_guess_guess
是通常基于基于的值基准测试(或者对目标平台的深刻理解,尽管这种做法通常表明即使是高技能的开发人员也无法找到最佳价值)。问题是延迟非常依赖于,其中数据来自和目标平台。 在大多数情况下,除非开发人员在唯一的给定目标平台上工作,否则开发人员何时确切地进行预取。。这使得在目标平台更改时(必须考虑代码的可维护性和指令的开销)时,软件预取棘手的使用易用,并且经常有害。更不用说内置依赖于编译器的,预摘要的内在依赖性依赖于体系结构,并且没有标准的便携式方法使用软件预取用。
是的,预取指令不是免费的,因此最好只使用每条缓存线1个说明(因为同一缓存线上的其他预取指令将是毫无用处的)。
这非常取决于目标平台。现代主流X86-64处理器以排序的方式并行执行说明,并分析了非常巨大的指导窗口。他们倾向于尽快执行负载,以免失踪,并且通常非常适合这样的工作。
在您的示例循环中,我希望硬件预摘要应该做得很好,并且在(相对较新的)主流处理器上使用软件预取的方法应较慢。
当硬件预取器在十年前不是很聪明,但如今它们往往非常好时,软件预取用是有用的。此外,与使用软件预取指令相比,指导硬件预脱果比使用软件预取指令更好,因为前者的开销较低。这就是为什么软件预摘要不建议使用(例如英特尔和大多数开发人员),除非您真的知道自己在做什么。
You can check the proportion of cache misses.
perf
or VTune can be used to get this information thanks to hardware performance counters. You can get the list withperf list
for example. The list is dependent of the target processor architecture but there are some generic events. For example,L1-dcache-load-misses
,LLC-load-misses
andLLC-store-misses
. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters likeL1-dcache-loads
,LLC-loads
orLLC-stores
. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for examplel2_rqsts.miss
on Intel Skylake-like processors). To get the overall statistics, you can useperf stat -e an_hardware_counter,another_one your_program
. A good documentation can be found here.When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.
Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use
perf record
+perf report
so to get the information (see the previous tutorial for more information).Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.
This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are
_mm_prefetch()
locality hints? for more information.An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.
This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...
The usual solution is to write a code like this:
Where
magic_distance_guess
is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.
Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).
This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.
In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.
Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.
快速答案是:不要。
正如您正确分析的那样,预拿到是一种棘手且高级的优化技术,无法便携,很少有用。
您可以使用分析来确定哪些代码形成瓶颈的哪些部分,并使用专用工具,例如 valgrind 尝试使用编译器内置的识别可能避免使用的高速缓存失误。
从中不要期望太多,但是请介绍代码以将优化工作集中在可能有用的地方。
还请记住,更好的算法可以击败针对大型数据集的优化实现效率较低的实现。
The quick answer is: don't.
As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.
You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.
Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.
Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.
如其他答案所述,SW预摘要的管理需要大量的手动努力,并且很难在不同的系统和工作负载之间推广。现代CPU上的HW预摘要取得了足够的进步,并且可以识别不同的内存访问模式。
尽管本文有些古老,但[1]从2012年开始广泛讨论HW Prefetcher和SW Prefetcher,包括您的问题。作者声称,SW预摘要适合诸如短阵列,连续和不规则读取之类的场景。
有趣的是,现代系统中仍然有许多模式无法得到HW预摘要的认可,例如 point Chasing 。而且,如果任务中有一定的计算延迟,则可以使用SW预取式来隐藏内存访问延迟。例如,[2,3]提出了一种相对一般的设计解决方案,该解决方案使用coroutines与SW预取的重叠计算以隐藏数据读取延迟。
尤其是当访问不快速或像DRAM这样的带宽充足的内存时,SW预摘要的优势将进一步增加。此外,HW预取器甚至可能通过不正确的预取数据影响缓存以外的组件的性能。
As mentioned by other answers, the management of SW prefetcher requires a significant amount of manual effort and is difficult to generalize across different systems and workloads. The HW prefetcher on modern CPUs has made sufficient progress and can recognize different memory access patterns.
Although this paper is somewhat old, [1] from 2012 extensively discusses HW prefetcher and SW prefetcher, including your questions. The author claim that SW prefetcher is suitable for scenarios such as short arrays, continuous and irregular reads.
Interestingly, there are still many patterns in modern systems that cannot be well recognized by HW prefetchers, such as point chasing. And if it is multi-get or there is certain computational latency in the task, you can use SW prefetch to hide memory access latency. For example, [2,3] presents a relatively general design solution that uses coroutines to overlap computation with SW prefetching to hide data read latency.
Especially when accessing memory that is not fast or bandwidth-sufficient like DRAM, the advantages of SW prefetcher will further increase. Additionally, HW prefetchers may even impact the performance of components other than cache through incorrect prefetched data.
GCC具有
-fprefetch-loop-arrays
选项,但我不建议它供一般使用,而仅作为实验,在对特定循环进行微问题时。 ()> mtune = nather = nather
选项可以使CPU启用,在知道HW Preffrefetchers的情况下可以使用帮助(并且前端带宽足够高,可以处理运行预取指令的额外吞吐量成本,而通常不会放慢速度,尤其是在L2或L1D Cache中的数据已经很热的情况下。)有一些
- param
- PARAM PRAM PRAGR PRAGIM-MINIMIM-STRIDE = n
,可以将其限制为仅在指针增量为多个缓存线或其他内容时仅生成预取指令。 (现代X86 CPU中的硬件预摘要可以处理有障碍的访问模式,尽管HW预摘要通常不会在4K边界上工作,因为连续的虚拟页面可能无法映射到连续的虚拟虚拟页面。端阶外exec可以在下一个下一个deptrate Execs生成需求负载到下一个页面,通常足够好。)另请参见每个程序员对内存应该知道多少? - SW Prefetch通常不值得。
GCC has a
-fprefetch-loop-arrays
option, but I wouldn't recommend it for general use, only as an experiment while microbenchmarking a specific loop. (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fprefetch-loop-arrays)Some
-mtune=whatever
options may enable that, for CPUs where it's known that the HW prefetchers can use the help (and that front-end bandwidth is high enough to handle the extra throughput cost of running prefetch instructions without usually slowing things down much, especially if data is already hot in L2 or L1d cache.)There are some
--param
tuning parameters, like--param prefetch-minimum-stride=n
that could limit it to only generating prefetch instructions when the pointer increment is multiple cache lines or something. (Hardware prefetchers in modern x86 CPUs can handle strided access patterns, although HW prefetchers don't typically work across 4K boundaries, since contiguous virtual pages might not map to contiguous virtual pages. Out-of-order exec can generate demand loads into the next page, which is often good enough.)See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - SW prefetch is often not worth it.