访问各种缓存和主内存的大致成本？

如梦 2024-10-07 03:01:19

每个人都应该知道的数字

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

来自：
最初作者：Peter Norvig：
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01 /01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

Numbers everyone should know

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

From:
Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers
- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,
- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine

回复收藏 0 原文

小梨窩很甜 2024-10-07 03:01:19

此处是 i7 和 Xeon 系列处理器的性能分析指南。我应该强调，这有你所需要的以及更多（例如，检查第 22 页的一些时间和周期）。

此外，此页面提供了有关时钟周期等的一些详细信息第二个链接提供以下号码：

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2：

最重要的是引用表下的通知，内容是：

_{“注意：这些值是粗略的近似值。它们取决于

核心和非核心频率、内存速度、BIOS 设置、

DIMM 数量、等等......您的里程可能会有所不同。”}

编辑：我应该强调一下，以及时序/周期信息，上述英特尔文档地址i7 和 Xeon 系列处理器的更多（极其）有用的细节（从性能角度来看）。

Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).

Additionally, this page has some details on clock cycles etc. The second link served the following numbers:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

EDIT2:

The most important is the notice under the cited table, saying:

_{"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON

CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS,

NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY."}

EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).

回复收藏 0 原文

幸福％小乖 2024-10-07 03:01:19

在漂亮的页面中访问各种内存的成本

请参阅此页面介绍内存从 1990 年到 2020 年，延迟有所下降。

摘要

自 2005 年以来数值有所下降但趋于稳定

 1 ns L1 缓存
        3 ns 分支错误预测
        4 纳秒二级缓存
       17 ns 互斥锁/解锁
      100 ns 主存储器 (RAM)
    2 000 ns (2µs) 1KB Zippy 压缩

仍有一些改进，预测 2020 年

 16 000 ns (16µs) SSD 随机读取（olibre 的注意：应该更少）
  500 000 ns (½ms) 数据中心往返
2 000 000 ns (2ms) HDD 随机读取（查找）

另请参阅其他来源

每个程序员都应该了解的内存知识 来自 Ulrich Drepper (2007)
虽然很旧，但仍然是关于内存硬件和软件交互的精彩深入解释。
- 完整 PDF（114 页）
  - LWN 关于 PDF 版本的评论
  - 另一个一个
- LWN 上有 7 个帖子 + 评论
发布 codinghorror.com 中单词之间的无限空间，基于书籍系统性能：企业和云
单击 http://www.7-cpu.com/ 查看 L1/L2/L3/RAM/... 延迟（例如 Haswell i7-4770 具有 L1=1ns、L2=3ns、L3=10ns、RAM=67ns、BranchMisprediction=4ns)
< a href="http://idarkside.org/posts/numbers-you-should-know/" rel="noreferrer">http://idarkside.org/posts/numbers-you-should-know/

另请参阅

为了进一步理解，我推荐优秀的现代缓存演示架构（2014 年 6 月）来自格哈德·韦莱因、汉内斯·霍夫曼和Dietmar Fey，位于埃尔兰根-纽伦堡大学。

法语人士可能会欣赏 SpaceFox 比较处理器和开发人员都在等待获取继续工作所需的信息。

Cost to access various memories in a pretty page

See this page presenting the memory latency decrease from 1990 to 2020.

Summary

Values having decreased but are stabilized since 2005

        1 ns        L1 cache
        3 ns        Branch mispredict
        4 ns        L2 cache
       17 ns        Mutex lock/unlock
      100 ns        Main memory (RAM)
    2 000 ns (2µs)  1KB Zippy-compress

Still some improvements, prediction for 2020

   16 000 ns (16µs) SSD random read (olibre's note: should be less)
  500 000 ns (½ms)  Round trip in datacenter
2 000 000 ns (2ms)  HDD random read (seek)

只是为了回顾 2020 年对 2025 年的预测：

在集成电路技术的最后约 44 年里，经典（非量子）处理器在字面上和物理上不断发展“Per Aspera ad Astra”< /strong>。过去十年已经证明，经典过程已经接近一些障碍，并且没有可实现的物理前进道路。

逻辑核心的数量可以并且可能会增加，但不会超过O(n^2~3)< br>
频率 [MHz] 很难（如果不是不可能）规避已经达到的基于物理的上限
晶体管数量可以并且可能会增加，但小于O(n^2~3)（功率、噪声，“时钟”）
功率 [W] 可以增长，但功率分配和功率问题可能会增加。散热量会增加
单线程性能可能会增长，从大缓存占用空间和更快、更宽的内存 I/O 和内存中获得直接好处。系统强制上下文切换次数减少带来的间接好处是，我们可以拥有更多核心来分割

_{（学分归莱昂纳多·苏里亚诺& Karl Rupp )}

    2022: Still some improvements, prediction for 2025+
--------------------------------------------------------------------------------
                 0.001 ns light transfer in Gemmatimonas phototrophica bacteriae
      |   |   |   |   |
      |   |   |   | ps|
      |   |   | ns|
      |   | us|        reminding us what Richard FEYNMAN told us:
      | ms|                             "There's a plenty of space
     s|                                                      down there"

-----s.-ms.-us.-ns|----------------------------------------------------------
                 0.1 ns - NOP
                 0.3 ns - XOR, ADD, SUB
                 0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                 0.9 ns - JMP SHORT
                 1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
    ?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
               3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                 5   ns - CPU L1 iCACHE Branch mispredict
                 7   ns - CPU L2  CACHE reference
                10   ns - DIV
                19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
                71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
               100   ns - MUTEX lock/unlock
               100   ns - own DDR MEMORY reference
               135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
               202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
               325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
    |Q>~~~~~ 5,000   ns - QPU on-chip QUBO ( quantum annealer minimiser 1 Qop )
            10,000   ns - Compress 1K bytes with a Zippy PROCESS
            20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
           250,000   ns - Read   1 MB sequentially from  MEMORY
           500,000   ns - Round trip within a same DataCenter
    ?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
        10,000,000   ns - DISK seek
        10,000,000   ns - Read   1 MB sequentially from  NETWORK
    ?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
        30,000,000   ns - Read 1 MB sequentially from a  DISK
    ?~~ 36,000,000   ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
       150,000,000   ns - Send a NETWORK packet CA -> Netherlands
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

只是为了回顾 2015 年对 2020 年的预测：

仍然有一些改进，对 2020 年的预测（下面参考 olibre 的回答）

            16 000 ns ( 16 µs) SSD random read (olibre's note: should be less)
           500 000 ns (  ½ ms) Round trip in datacenter
         2 000 000 ns (  2 ms) HDD random read (seek)
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

In 2015 there are currently available:
======================================
               820 ns ( 0.8µs) random read from a SSD-DataPlane
             1 200 ns ( 1.2µs) Round trip in datacenter
             1 200 ns ( 1.2µs) random read from a HDD-DataPlane
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

只是为了 CPU 和 GPU 延迟情况比较：

即使是最简单的 CPU / 缓存 / DRAM 阵容（即使在统一内存访问模型中）的比较也不是一件容易的事，其中 DRAM 速度是决定延迟的一个因素，而加载延迟（饱和系统）则由后者决定企业应用程序将经历的不仅仅是闲置的完全卸载的系统。

                    +----------------------------------- 5,6,7,8,9,..12,15,16 
                    |                               +--- 1066,1333,..2800..3300
                    v                               v
First  word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate  
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
                                        ^----------------------- 7x .. difference
******************************** 
So:
===

resulting DDR3-side latencies are between _____________
                                          3.03 ns    ^
                                                     |
                                         36.58 ns ___v_ based on DDR3 HW facts

GPU 引擎已经接受了大量的技术营销，而深刻的内部依赖性是了解这些架构在实践中经历的真正优势和真正弱点的关键（通常与激进的营销宣传有很大不同）期望）。

   1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
          °      ^
          |\     |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
          | \    |
          |  \   |
        __|___\__v____________________________________________________
          |    |
          |<-->|  a 1 ns TimeDOMAIN "distance", before a foton arrived
          |    |
          ^    v 
    DATA  |    |DATA
    RQST'd|    |RECV'd ( DATA XFER/FETCH latency )

  25 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor REGISTER access
  35 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor    L1-onHit-[--8kB]CACHE

  70 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor SHARED-MEM access

 230 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
 320 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE

 350 ns
 700 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor GLOBAL-MEM access
 - - - - -

因此，理解内部性比其他领域重要得多，因为其他领域的架构是公开的，并且许多基准测试都是免费提供的。非常感谢 GPU 微型测试人员，他们花费了时间和创造力来揭示黑盒方法测试的 GPU 设备内部真实工作方案的真相。

    +====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i
    |

Just for a sake of 2020's review of the predictions for 2025:

The last about 44 years of the integrated circuit technology, the classical (non-quantum) processors evolved, literally and physically "Per Aspera ad Astra". The last decade has evidenced, the classical process has got close to some hurdles, that do not have an achievable physical path forward.

Number of logical cores can and may grow, yet not more than O(n^2~3)
Frequency [MHz] has hard if not impossible to circumvent physics-based ceiling already hit
Transistor Count can and may grow, yet less than O(n^2~3) ( power, noise, "clock")
Power [W] can grow, yet problems with power distribution & heat dissipation will increase
Single Thread Perf may grow, having direct benefits from large cache-footprints and faster and wider memory-I/O & indirect benefits from less often system forced context-switching as we can have more cores to split other threads/processes among

_{( Credits go to Leonardo Suriano & Karl Rupp )}

    2022: Still some improvements, prediction for 2025+
--------------------------------------------------------------------------------
                 0.001 ns light transfer in Gemmatimonas phototrophica bacteriae
      |   |   |   |   |
      |   |   |   | ps|
      |   |   | ns|
      |   | us|        reminding us what Richard FEYNMAN told us:
      | ms|                             "There's a plenty of space
     s|                                                      down there"

-----s.-ms.-us.-ns|----------------------------------------------------------
                 0.1 ns - NOP
                 0.3 ns - XOR, ADD, SUB
                 0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                 0.9 ns - JMP SHORT
                 1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
    ?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
               3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                 5   ns - CPU L1 iCACHE Branch mispredict
                 7   ns - CPU L2  CACHE reference
                10   ns - DIV
                19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
                71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
               100   ns - MUTEX lock/unlock
               100   ns - own DDR MEMORY reference
               135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
               202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
               325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
    |Q>~~~~~ 5,000   ns - QPU on-chip QUBO ( quantum annealer minimiser 1 Qop )
            10,000   ns - Compress 1K bytes with a Zippy PROCESS
            20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
           250,000   ns - Read   1 MB sequentially from  MEMORY
           500,000   ns - Round trip within a same DataCenter
    ?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
        10,000,000   ns - DISK seek
        10,000,000   ns - Read   1 MB sequentially from  NETWORK
    ?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
        30,000,000   ns - Read 1 MB sequentially from a  DISK
    ?~~ 36,000,000   ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
       150,000,000   ns - Send a NETWORK packet CA -> Netherlands
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

Just for a sake of 2015's review of the predictions for 2020:

Still some improvements, prediction for 2020 (Ref. olibre's answer below)

            16 000 ns ( 16 µs) SSD random read (olibre's note: should be less)
           500 000 ns (  ½ ms) Round trip in datacenter
         2 000 000 ns (  2 ms) HDD random read (seek)
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

In 2015 there are currently available:
======================================
               820 ns ( 0.8µs) random read from a SSD-DataPlane
             1 200 ns ( 1.2µs) Round trip in datacenter
             1 200 ns ( 1.2µs) random read from a HDD-DataPlane
    1s:   |   |   |
      .   |   | ns|
      .   | us|
      . ms|

Just for a sake of CPU and GPU latency landscape comparison:

Not an easy task to compare even the simplest CPU / cache / DRAM lineups ( even in a uniform memory access model ), where DRAM-speed is a factor in determining latency, and loaded latency (saturated system), where the latter rules and is something the enterprise applications will experience more than an idle fully unloaded system.

                    +----------------------------------- 5,6,7,8,9,..12,15,16 
                    |                               +--- 1066,1333,..2800..3300
                    v                               v
First  word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate  
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
                                        ^----------------------- 7x .. difference
******************************** 
So:
===

resulting DDR3-side latencies are between _____________
                                          3.03 ns    ^
                                                     |
                                         36.58 ns ___v_ based on DDR3 HW facts

GPU-engines have received a lot of technical marketing, while deep internal dependencies are keys to understand both the real strengths and also the real weaknesses these architectures experience in practice ( typically much different than the aggressive marketing whistled-up expectations ).

   1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
          °      ^
          |\     |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
          | \    |
          |  \   |
        __|___\__v____________________________________________________
          |    |
          |<-->|  a 1 ns TimeDOMAIN "distance", before a foton arrived
          |    |
          ^    v 
    DATA  |    |DATA
    RQST'd|    |RECV'd ( DATA XFER/FETCH latency )

  25 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor REGISTER access
  35 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor    L1-onHit-[--8kB]CACHE

  70 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor SHARED-MEM access

 230 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
 320 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE

 350 ns
 700 ns @ 1147 MHz FERMI:  GPU Streaming Multiprocessor GLOBAL-MEM access
 - - - - -

Understanding internalities is thus much more important, than in other fields, where architectures are published and numerous benchmarks freely available. Many thanks to GPU-micro-testers, who 've spent their time and creativity to unleash the truth of the real schemes of work inside the black-box approach tested GPU devices.

    +====================| + 11-12 [usec] XFER-LATENCY-up   HostToDevice    ~~~ same as Intel X48 / nForce 790i

    |

回复收藏 0 原文

空名 2024-10-07 03:01:19

|||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
|

回复收藏 0 原文

奢欲 2024-10-07 03:01:19

|||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
|

回复收藏 0 原文

ぽ尐不点ル 2024-10-07 03:01:19

|||||||| ~ 5.2 GB/sec XFER-BW-down @8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|
| Host-side
| cudaHostRegister( void *ptr, size_t size, unsigned int flags )
| | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed
| ___HostAllocWriteCombined_MEM / cudaHostFree() +---------------- cudaHostRegisterMapped -- maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); )
| ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr )
|

回复收藏 0 原文

七分※倦醒 2024-10-07 03:01:19

||||||||
|

回复收藏 0 原文

铜锣湾横着走 2024-10-07 03:01:19

||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|
| + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber
| ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:]
|
| ~810 [ns] + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any
|
|

回复收藏 0 原文

不爱素颜 2024-10-07 03:01:19

||||||||
|

回复收藏 0 原文

傲鸠 2024-10-07 03:01:19

||||||||
+====================|
|.pci............HOST|

我对“更大的图景”表示歉意，但是延迟消除也对片上 smREG/L1/L2 容量和命中/未命中率施加了基本限制。

    |.pci............GPU.|
    |                    | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES                                                                  PAR --

||||||||
+====================|
|.pci............HOST|

My apology for a "bigger-picture", but latency-demasking has also cardinal limits imposed from on-chip smREG/L1/L2-capacities and hit/miss-rates.

    |.pci............GPU.|

    |                    | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES                                                                  PAR --

回复收藏 0 原文

云醉月微眠 2024-10-07 03:01:19

回复收藏 0

舂唻埖巳落 2024-10-07 03:01:19

回复收藏 0

つ低調成傷 2024-10-07 03:01:19

回复收藏 0

十秒萌定你 2024-10-07 03:01:19

回复收藏 0

王权女流氓 2024-10-07 03:01:19

回复收藏 0

丶视觉 2024-10-07 03:01:19

回复收藏 0

黑寡妇 2024-10-07 03:01:19

||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] @1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] @ 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] @1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR --

回复收藏 0 原文

清眉祭 2024-10-07 03:01:19

||||||||~~~~~~|
|
| | | | | _____________________________ +========| DEVICE:3 PERSISTENT | _|_____________________________ +======| DEVICE:2 PERSISTENT | _|_______________________________ +====| DEVICE:1 PERSISTENT | _|_________________________________ +==| DEVICE:0 PERSISTENT ! | o | | | | | | | | | | | | | +220 [GPU-CLKs]_| | L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB | L1-on-re-use-only +40 [GPU-CLKs]_| | L1-on-re-use-only + 8 [GPU-CLKs]_| | on-chip|smREG +22 [GPU-CLKs]_| |CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |3.5 255 REGISTERs|_|_|_|_|_|_|_|_| | per|_|_|_|_|_|_|_|_|_|_|_| | Thread_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_| | |_|_|_|_|_|_|_|_|_|_|_|tBlock |
| ________________ | / smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
with about 8.0 TB/s BW [C:Pg.46]
1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
0.1 TB/s BW gloMEM___
___________________________________________________________________________________________________________________________________________________________________________________________
gloMEM___
_________________________________________________________________________________________________________________________________________________________________________________________
gloMEM___
_______________________________________________________________________________________________________________________________________________________________________________________
gloMEM___
_____________________________________________________________________________________________________________________________________________________________________________________
gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
|\ + |
texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
|\ \ |\ + |\ |
texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB|
| \ \ | \ + |\ ^ \ |
| \ \ | \ + | \ ^ \ |
| \ \ | \ + | \ ^ \ |
texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB|
| \ \ | \ + ^\ ^ \ ^\ \ |
shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB|
| \ \ | \ ^\ + ^ \ ^ \ ^ \ \ |
| \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ |
______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
|_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
|t[0_______^:~~~~~~~~~~~~~~~~\:________]
|t[1_______^ :________]
|t[2_______^ :________]
|t[3_______^ :________]
|t[4_______^ :________]
|t[5_______^ :________]
|t[6_______^ :________]
|t[7_______^ 1stHalf-WARP :________]______________
|t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
|t[ 9_______^ :________]
|t[ A_______^ :________]
|t[ B_______^ :________]
|t[ C_______^ :________]
|t[ D_______^ :________]
|t[ E_______^ :________]
W0..|t[ F_______^____________WARP__:________]_____________
..............
............|t[0_______^:~~~~~~~~~~~~~~~\:________]
............|t[1_______^ :________]
............|t[2_______^ :________]
............|t[3_______^ :________]
............|t[4_______^ :________]
............|t[5_______^ :________]
............|t[6_______^ :________]
............|t[7_______^ 1stHalf-WARP :________]______________
............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
............|t[ 9_______^ :________]
............|t[ A_______^ :________]
............|t[ B_______^ :________]
............|t[ C_______^ :________]
............|t[ D_______^ :________]
............|t[ E_______^ :________]
W1..............|t[ F_______^___________WARP__:________]_____________
....................................................
...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________]
...................................................|t[1_______^ :________]
...................................................|t[2_______^ :________]
...................................................|t[3_______^ :________]
...................................................|t[4_______^ :________]
...................................................|t[5_______^ :________]
...................................................|t[6_______^ :________]
...................................................|t[7_______^ 1stHalf-WARP :________]______________
...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
...................................................|t[ 9_______^ :________]
...................................................|t[ A_______^ :________]
...................................................|t[ B_______^ :________]
...................................................|t[ C_______^ :________]
...................................................|t[ D_______^ :________]
...................................................|t[ E_______^ :________]
Wn....................................................|t[ F_______^___________WARP__:________]_____________
°°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
\ CC-2.0

||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | |\ + |
o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| |\ \ |\ + |\ |
| texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB|
| | \ \ | \ + |\ ^ \ |
| | \ \ | \ + | \ ^ \ |
| | \ \ | \ + | \ ^ \ |
| texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB|
| | \ \ | \ + ^\ ^ \ ^\ \ |
| shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB|
| | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ |
| | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ |
| ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ....................................................
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / \ CC-2.0

回复收藏 0 原文

撕心裂肺的伤痛 2024-10-07 03:01:19

回复收藏 0

椵侞 2024-10-07 03:01:19

|||||| ~masked

回复收藏 0 原文

半窗疏影 2024-10-07 03:01:19

回复收藏 0

揽月 2024-10-07 03:01:19

回复收藏 0

吖咩 2024-10-07 03:01:19

回复收藏 0

雨落星ぅ辰 2024-10-07 03:01:19

回复收藏 0

月依秋水 2024-10-07 03:01:19

回复收藏 0

我爱人 2024-10-07 03:01:19

回复收藏 0

谜泪 2024-10-07 03:01:19

||||
| / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \ | //
| \ RR-mode //
| \ GREEDY-mode //
| \________________//
| \______________/SM:0__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:1__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:2__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:3__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:4__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:5__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:6__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:7__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:8__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:9__________________________________________________________________________________
| ..|SM:A |t[ F_______^___________WARP__:________]_______
| ..|SM:B |t[ F_______^___________WARP__:________]_______
| ..|SM:C |t[ F_______^___________WARP__:________]_______
| ..|SM:D |t[ F_______^___________WARP__:________]_______
| |_______________________________________________________________________________________
*/

底线是什么？

任何低延迟动机的设计都必须对“I/O 液压”进行逆向工程（因为 0 1-XFER 本质上是不可压缩的），并且由此产生的延迟决定了任何 GPGPU 解决方案的性能范围，无论它是计算密集型的（< em>read：其中处理成本可以容忍更多的延迟 XFER ...）或不（read：其中（可能令人惊讶）CPU 在以下方面更快端到端处理，而不是 GPU 结构 [可用引用]）。

||||
| / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \ | //
| \ RR-mode //
| \ GREEDY-mode //
| \________________//
| \______________/SM:0__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:1__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:2__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:3__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:4__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:5__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:6__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:7__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:8__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:9__________________________________________________________________________________
| ..|SM:A |t[ F_______^___________WARP__:________]_______
| ..|SM:B |t[ F_______^___________WARP__:________]_______
| ..|SM:C |t[ F_______^___________WARP__:________]_______
| ..|SM:D |t[ F_______^___________WARP__:________]_______
| |_______________________________________________________________________________________
*/

The bottom line?

Any low-latency motivated design has to rather reverse-engineer the "I/O-hydraulics" ( as 0 1-XFERs are incompressible by the nature ) and the resulting latencies rule the performance envelope for any GPGPU solution be it computationally intensive ( read: where processing costs are forgiving a bit more a poor latency XFERs ... ) or not ( read: where ( might be to someone's surprise ) CPU-s are faster in end-to-end processing, than GPU fabrics [citations available] ).

回复收藏 0 原文

沉溺在你眼里的海 2024-10-07 03:01:19

看看这个“楼梯”图，完美地说明了不同的访问时间（就时钟抽动而言）。请注意，红色 CPU 有一个额外的“步骤”，可能是因为它有 L4（而其他 CPU 没有）。

不同内存层次结构的访问时间图表

摘自这篇 Extremetech 文章。

在计算机科学中，这称为“I/O 复杂性”。

回复收藏 0 原文

访问各种缓存和主内存的大致成本？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（29）

在漂亮的页面中访问各种内存的成本

摘要

另请参阅其他来源

另请参阅

Cost to access various memories in a pretty page

Summary

See also other sources

See also

只是为了回顾 2020 年对 2025 年的预测：

只是为了回顾 2015 年对 2020 年的预测：

只是为了 CPU 和 GPU 延迟情况比较：

Just for a sake of 2020's review of the predictions for 2025:

Just for a sake of 2015's review of the predictions for 2020:

Just for a sake of CPU and GPU latency landscape comparison:

底线是什么？

The bottom line?

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

访问各种缓存和主内存的大致成本？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（29）

在漂亮的页面中访问各种内存的成本

摘要

另请参阅其他来源

另请参阅

Cost to access various memories in a pretty page

Summary

See also other sources

See also

只是为了回顾 2020 年对 2025 年的预测：

只是为了回顾 2015 年对 2020 年的预测：

只是为了 CPU 和 GPU 延迟情况比较：

Just for a sake of 2020's review of the predictions for 2025:

Just for a sake of 2015's review of the predictions for 2020:

Just for a sake of CPU and GPU latency landscape comparison:

底线是什么？

The bottom line?

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实