两条相似线路的 CPU 时间差异

发布于 2024-11-19 03:19:01 字数 2520 浏览 3 评论 0原文

我的程序中有一个 while 循环，其中 IterZNext、IterZ 是指向列表中节点的指针。列表中的节点是结构类型，带有一个名为“Index”的字段。

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex1 = IterZNext->Index; // line (*)
        NextIndex = IterZNext->Index;  // line (**)
        IterZNext->Index;
    }
}

当我分析我的程序时，我发现 (*) 行

NextIndex1 = IterZNext->Index;

消耗了大部分 CPU 时间（2.193 秒），而

NextIndex = IterZNext->Index;

与 (*) 行几乎相同的 (**) 行只使用了 0.093 秒。我用Intel VTune Amplifier看了一下这两行的汇编，如下：

Address Line    Assembly                   CPU Time Instructions Retired
Line (*):
0x1666  561 mov eax, dword ptr [ebp-0x44]   0.015s  50,000,000
0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000

Line (**):
0x166f  562 mov byte ptr [ebp-0x155], 0x1   0.039s  80,000,000
0x1676  562 mov eax, dword ptr [ebp-0x44]   0.027s  44,000,000
0x1679  562 mov ecx, dword ptr [eax+0x8]        
0x167c  562 mov dword ptr [ebp-0x5c], ecx   0.026s  94,000,000

如果我改变行()和行(*)的顺序，那么程序就会变成

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex = IterZNext->Index;  // line (**)
                NextIndex1 = IterZNext->Index; // line (*)
        IterZNext->Index;
    }
}

和汇编结果更改为

Address Line    Assembly    CPU Time    Instructions Retired
Line (**):
0x1666  560 mov byte ptr [ebp-0x155], 0x1   0.044s  84,000,000
0x166d  560 mov eax, dword ptr [ebp-0x44]   0.006s  2,000,000
0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

Line (*):
0x1676  561 mov eax, dword ptr [ebp-0x44]   0.052s  128,000,000
0x1679  561 mov ecx, dword ptr [eax+0x8]        
0x167c  561 mov dword ptr [ebp-0x68], ecx   0.034s  112,000,000

在这种情况下，行 (*) 使用了大部分 CPU 时间 (1.245 秒)，而行 () 仅使用了 0.086 秒。

有人可以告诉我： (1) 为什么第一次作业要花这么长时间？请注意，zz=xx+yy 行仅使用 0.058 秒。这与缓存未命中有关吗？因为列表中的所有节点都是动态生成的。 (2) 为什么这两条线的CPU时间有巨大差异？

谢谢！

原文

There is a while loop in my program, where IterZNext, IterZ are pointers to nodes in a list. The nodes in the list are of type struct with a field called "Index".

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex1 = IterZNext->Index; // line (*)
        NextIndex = IterZNext->Index;  // line (**)
        IterZNext->Index;
    }
}

When I profiled my program, I found the line (*)

NextIndex1 = IterZNext->Index;

consumes most of CPU time (2.193s), while the line (**)

NextIndex = IterZNext->Index;

which is all most the same with the line (*) only uses 0.093s. I used the Intel VTune Amplifier to see the assembly of these two lines, which is as follows:

Address Line    Assembly                   CPU Time Instructions Retired
Line (*):
0x1666  561 mov eax, dword ptr [ebp-0x44]   0.015s  50,000,000
0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000

Line (**):
0x166f  562 mov byte ptr [ebp-0x155], 0x1   0.039s  80,000,000
0x1676  562 mov eax, dword ptr [ebp-0x44]   0.027s  44,000,000
0x1679  562 mov ecx, dword ptr [eax+0x8]        
0x167c  562 mov dword ptr [ebp-0x5c], ecx   0.026s  94,000,000

If I change the order of the line () and the line (*), then the program changes to

double xx = 20.0;
double yy = 10000.0;
double zz;      
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
    IterZ=IterZNext;
    IterZNext = IterZ->Next;
    if (IterZNext!=NULL)
    {
        zz = xx + yy;
                NextIndex = IterZNext->Index;  // line (**)
                NextIndex1 = IterZNext->Index; // line (*)
        IterZNext->Index;
    }
}

and the result for assembly changes to

Address Line    Assembly    CPU Time    Instructions Retired
Line (**):
0x1666  560 mov byte ptr [ebp-0x155], 0x1   0.044s  84,000,000
0x166d  560 mov eax, dword ptr [ebp-0x44]   0.006s  2,000,000
0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

Line (*):
0x1676  561 mov eax, dword ptr [ebp-0x44]   0.052s  128,000,000
0x1679  561 mov ecx, dword ptr [eax+0x8]        
0x167c  561 mov dword ptr [ebp-0x68], ecx   0.034s  112,000,000

In this case, line (*) uses most of CPU time (1.245s) while line () only uses 0.086s.

Could someone tell me:
(1) Why does it take so long to make the first assignment? Notice that the line zz=xx+yy only uses 0.058s. Is this related to the cache misses? since all nodes in the list are dynamically genereated.
(2) Why is there huge difference in CPU time between this two lines?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ㄖ落Θ余辉 2024-11-26 03:19:01

所有现代 CPU 都是超定标器和超定标器。无序 - 这意味着指令实际上并不是按照汇编的顺序执行的，并且实际上不存在当前 PC 之类的东西 - 有许多 10 条指令正在运行并同时执行。

因此，CPU 报告的任何采样信息只是 CPU 正在执行的粗略区域 - 它正在执行采样中断关闭时指示的指令；但它也正在执行所有其他飞行中的任务！

然而，人们已经习惯（并期望）分析工具能够准确地告诉他们 CPU 当前正在运行哪条指令 - 因此，当采样中断触发时，CPU 本质上会选择众多活动指令中的一个“当前”的那个。

回复收藏 0 原文

我的痛♀有谁懂 2024-11-26 03:19:01

CPU 行缓存可能是原因。访问 [ebp-0x5c] 也会进入缓存 [ebp-0x68]，然后获取速度会更快（对于第二种情况，第一种情况反之亦然）。

回复收藏 0 原文

橙幽之幻 2024-11-26 03:19:01

这肯定是由于缓存未命中造成的。那么更大的失误将由处理器带来更多的性能损失。实际上，在现代世界中，CPU 的执行速度比内存快得多。如果现在处理器的时钟频率可以在4GHz左右，内存仍然可以以~0.3GHz的频率运行。这是巨大的性能差距，而且还在继续扩大。缓存的引入是为了隐藏这一差距。如果没有缓存，使用现代处理器将花费大量时间等待内存中的数据，并且当时什么也不做。除了性能差距之外，每次内存访问还会产生额外的延迟，这些延迟与内存总线上与其他 CPU 和 DMA 设备的可能并发相关，以及内存访问请求处理和处理器内存管理逻辑侧路由（检查缓存）所需的时间有关。在所有级别中，虚拟到物理地址转换可能涉及 TLB 未命中以及对内存的额外访问、请求推送到内存总线等）和内存控制器（请求从 CPU 控制器路由到控制器内存总线，可能等待用于存储库刷新周期完成等）。因此，总而言之，与 L1 缓存命中或寄存器访问相比，对内存的原始访问的成本确实很高。成本差异与访问内存和辅助存储 (HDD) 中数据的成本差异相当。

此外，随着从处理器转移到内存，内存访问的成本将会增加。 L2 访问将提供比 L1 或 CPU 寄存器访问更大的惩罚，L3 访问将提供比 L2 访问更大的惩罚，最后，内存访问将提供比内存访问更大的惩罚。例如，您可以在下表中比较不同内存层次结构级别上的数据访问成本（摘自 http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6）

缓存/内存延迟比较

-----------------------------------------------------------
|                                |L1| L2| L3| Main Memory |
-----------------------------------------------------------
|AMD FX-8150 (3.6GHz)            | 4| 21| 65| 195         |
-----------------------------------------------------------
|AMD Phenom II X4 975 BE (3.6GHz)| 3| 15| 59| 182         |
-----------------------------------------------------------
|AMD Phenom II X6 1100T (3.3GHz) | 3| 14| 55| 157         |
-----------------------------------------------------------
|Intel Core i5 2500K (3.3GHz)    | 4| 11| 25| 148         |
-----------------------------------------------------------

关于您的特定情况：

0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000


0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000 /* confusing and looks like wrong report for me*/ 
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

在代码行中取消引用索引值会受到惩罚。

mov ecx, dword ptr [eax+0x8]

请注意，它首先访问列表中每个后续节点中的数据，到目前为止，您仅通过节点地址进行操作，但该地址的数据因此无法访问内存。
您说过，您使用动态列表，从缓存命中率的角度来看，这很糟糕。此外，我认为您有足够大的列表，这意味着您的缓存将被先前访问的数据（在先前迭代中访问的列表节点）所扰动，并且几乎总是在访问索引期间仅在 L3 缓存上出现缓存未命中或缓存命中在每个新的迭代中。但请注意，在第一次访问索引时，每次涉及缓存未命中的每次新迭代时，从内存返回的数据将存储在 L1 缓存中。当您在同一周期迭代中第二次访问 Index 时，您将获得低成本的 L1 缓存命中！

因此，我希望我能就您的两个问题提供详细的答案。

关于VTune报告的正确性。我想倡导英特尔 VTune 开发人员。当然，现代处理器是非常复杂的设备，板上有许多 ILP 改进技术，包括流水线、超标量、乱序执行、分支预测等，当然这使得详细的指令级性能分析变得更加困难和珍贵。但像 VTune 这样的工具是在考虑到处理器功能的情况下开发的，我相信他们不会愚蠢地开发和提供没有任何意义的工具或功能。此外，英特尔的开发人员似乎没有其他人能够完全了解所有处理器功能的详细信息，并且没有其他人可以在分析器设计和开发过程中考虑到这些细节。

It is definitly due to the cache miss. Then bigger miss then more performance penalty will be introduced by processor. Actually in the modern world CPU performs much faster then memory. If nowadays processor can have clock frequency in about 4GHz, memory still run with frequency ~0.3GHz. It is great performance gap that is still continue to grow. Cache introduction was driven by desire to hide this gap. Without cache using modern processor will spend huge amount of time waiting data from the memory and doing nothing at that time. In addition to the performance gap, each memory access incurrs additional latencies related to posible concurrency on memory bus with other CPUs and DMA devices and times required for memory access request processing and routing on the side of the processor memory management logic (checking of the caches of all levels, virtual-to-physical address translation that can involve TLB miss with additional access to memory, request pushing to the memory bus etc.)and memory controller(request routing from the CPU-controller to the controller memory bus, possible waiting for memory bank refresh cycle complition etc.). So, to summarize, raw access to the memory have really big cost in comparision to the L1 cache hit or register access. The difference in cost comparable to the difference in cost to access data in memory and in secondary storage (HDD).

Furthermore, the cost of memory access will grow with moving from the processor to the memory. L2 access will provide bigger penalty then L1 or CPU registers access, L3 access will provide penalty bigger then L2 access and, finally, memory access will provide penalty bigger then memory access. For example, you can compare cost of data access on different levels of memory hierarchy in the next table (captured from http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6)

Cache/Memory Latency Comparison

-----------------------------------------------------------
|                                |L1| L2| L3| Main Memory |
-----------------------------------------------------------
|AMD FX-8150 (3.6GHz)            | 4| 21| 65| 195         |
-----------------------------------------------------------
|AMD Phenom II X4 975 BE (3.6GHz)| 3| 15| 59| 182         |
-----------------------------------------------------------
|AMD Phenom II X6 1100T (3.3GHz) | 3| 14| 55| 157         |
-----------------------------------------------------------
|Intel Core i5 2500K (3.3GHz)    | 4| 11| 25| 148         |
-----------------------------------------------------------

In regard to your particular case:

0x1669  561 mov ecx, dword ptr [eax+0x8]        
0x166c  561 mov dword ptr [ebp-0x68], ecx   2.178s  1,614,000,000


0x1670  560 mov ecx, dword ptr [eax+0x8]    0.001s  4,000,000 /* confusing and looks like wrong report for me*/ 
0x1673  560 mov dword ptr [ebp-0x5c], ecx   1.193s  1,536,000,000

You have penalty on dereferencing of the Index value in the code line.

mov ecx, dword ptr [eax+0x8]

Note that it is first access to the data in each subsequent node of you list, up to this moment you manipulate only by node address, but the data of that address and due to this haven't memory access.
You stated, that you use dynamical list, and this is bad from cache hit rate point of vies. In addition, I suppose that you have big enough list that mean that you will have cache fluded by the previously accessed data (list nodes accessed on previous iterations)and almost always will have cache miss or cache hit only on L3 cache during the accessing Index on each new iteration. But note, that during first access to the Index on each involving cache miss on each new iteration data returned from the memory will be stored in the L1 cache. And when you access Index on the second time during the same cycle iteration, you will have low cost L1 cache hit!

So I hope I provide you detailed answer on both your questions.

In regard to the correctness of the VTune report correctness. I want to advocate Intel VTune developers. Of course, modern processors are very complex devices with number of ILP improving technologies on the board including pipelining, superscalaring, Out-Of-Order execution, branch prediction etc, and of cource this make detailed instruction level performance analysis harder and more precious. But such tools like VTune is developed with that processor features in the mind, and I belive that they are not so stupied to develop and provide the tool or feature that haven't any sence. Furthermore it looks like the developers from the Intel like no another have access to the full understanding of the all processor features details and as no another can take this details into account during profiler designing and development.

回复收藏 0 原文

~没有更多了~