当前位置：文江博客话题详情

可放入硬件加速器的工作负载限制

发布于 2025-01-11 17:54:30 字数 124 浏览 5 评论 0原文

我有兴趣了解几乎永远无法放入硬件加速器的工作负载百分比。虽然越来越多的任务适合特定领域的加速器，但我想知道是否有可能存在加速器无用的任务？简而言之，哪些任务不太可能与加速器兼容？

希望有一个指向解决这个问题的资源的指针。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执着的年纪 2025-01-18 17:54:30

因此，您在原始帖子中提出了以下问题：

问题：

我想知道是否可能有一些对加速器没有用的任务？简而言之，哪些任务不太可能与加速器兼容？

答案：

当然有可能。首先，需要在硬件加速器上加速的工作负载不应涉及以下内容：

动态多态性和动态内存分配
运行时类型信息 (RTTI)
系统调用
............ （更多取决于硬件加速器）

虽然解释上述每一点会使帖子太长，但我可以解释很少。不支持动态内存分配，因为硬件加速器在芯片上具有固定的资源集，并且不支持动态创建和释放内存资源。同样，仅当可以在编译时确定指针对象时，才支持动态多态性。并且不应该有系统调用，因为这些是与在操作系统上执行某些任务相关的操作。因此，不支持操作系统操作，例如文件读/写或操作系统查询（例如时间和日期）。

话虽如此，不太可能与加速器兼容的工作负载大多是通信密集型内核。与CPU执行相比，这种通信密集型内核通常会导致严重的数据传输开销，这可以通过CPU-FPGA或CPU-GPU通信时间测量来检测。

为了更好地理解，我们看下面的例子：

通信密集型广度优先搜索(BFS)：

 1  procedure BFS(G, root) is
 2      let Q be a queue
 3      label root as explored
 4      Q.enqueue(root)
 5      while Q is not empty do
 6          v := Q.dequeue()
 7          if v is the goal then
 8              return v
 9          for all edges from v to w in G.adjacentEdges(v) do
10              if w is not labeled as explored then
11                  label w as explored
12                  Q.enqueue(w)

上面的伪代码就是著名的广度优先搜索(BFS)。为什么它不是加速的好候选者？因为它遍历了图中的所有节点，而没有进行任何重要的计算。因此，与计算密集型相比，它是极其通信密集型的。此外，对于像这样的数据驱动算法
BFS，输入的形状和结构实际上可以决定运行时特征，例如位置和 < a href="https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array">分支行为，使其不太成为硬件加速的最佳候选者。

现在问题来了，为什么我关注计算密集型而不是通信密集型？

由于您在帖子中标记了 FPGA，我可以向您解释有关 FPGA 的概念。例如，在 CPU 和 FPGA 之间使用 PCIe 连接的给定系统中，我们将 PCIe 传输时间计算为通过基于 PCIe 的直接内存访问将数据从主机内存移动到设备内存所用的时间 (DMA)。

PCIe 传输时间是针对通信有限工作负载筛选 FPGA 加速的一个重要因素。因此，上述 BFS 可能会产生严重的 PCIe 传输开销，因此不加速兼容。

另一方面，考虑实施对象识别算法系列作为深度神经网络。如果您仔细研究这些算法，您会发现大量时间（可能超过 90%）花费在卷积函数上。输入数据相对较小。卷积是令人尴尬地并行的。这使其成为转向硬件加速器的理想工作负载。

我们再举一个例子来展示硬件加速的完美工作负载：

计算密集型通用矩阵乘法(GEMM)：

void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
    int i, k, j, jj, kk;
    int i_row, k_row;
    TYPE temp_x, mul;

    loopjj:for (jj = 0; jj < row_size; jj += block_size){
        loopkk:for (kk = 0; kk < row_size; kk += block_size){
            loopi:for ( i = 0; i < row_size; ++i){
                loopk:for (k = 0; k < block_size; ++k){
                    i_row = i * row_size;
                    k_row = (k  + kk) * row_size;
                    temp_x = m1[i_row + k + kk];
                    loopj:for (j = 0; j < block_size; ++j){
                        mul = temp_x * m2[k_row + j + jj];
                        prod[i_row + j + jj] += mul;
                    }
                }
            }
        }
    }
}

上面的代码示例是通用矩阵乘法 (GEMM)。它是线性代数、机器学习、统计和许多其他领域的常见算法。此代码中的矩阵乘法更常见地使用分块计算
循环结构。进行算术运算以重用所有元素
在一个街区中，然后戏剧性地进入下一个街区
提高内存局部性。因此，它是一个计算极其密集且完美的加速候选者。

因此，仅举几例，我们可以得出以下结论：硬件加速的决定因素：

工作负载的负载、
工作负载访问的数据、工作
负载的并行程度
、可用于加速的底层芯片、
通信通道的带宽和延迟。

不要忘记阿姆达尔定律：

即使您已经找到了适合硬件加速的理想工作负载，斗争也不会就此结束。为什么？因为著名的阿姆达尔定律开始发挥作用。这意味着，您也许能够显着加快工作负载的速度，但如果它仅占应用程序运行时间的 2%，那么即使您无限加速（将运行时间设为 0），也只能加快整个应用程序的速度在系统层面提高了 2%。因此，您的理想工作负载不仅应该是算法上的理想工作负载，事实上，它还应该对系统的整体运行时间做出重大贡献。

So you have the following question(s) in your original post:

Question:

I wonder is it possible to have tasks that are not going to be useful with accelerator? Put simply, what are the tasks that are less likely to be accelerator-compatible?

Answer:

Of course it's possible. First and foremost, workload that needs to be accelerated on hardware accelerators should not involve following:

dynamic polymorphism and dynamic memory allocation
runtime type information (RTTI)
system calls
........... (some more depending on the hardware accelerator)

Although explaining each above-mentioned point will make the post too lengthy, I can explain few. There is no support of dynamic memory allocation because hardware accelerators have fixed set of resources on silicon, and the dynamic creation and freeing of memory resources is not supported. Similarly dynamic polymorphism is only supported if the pointer object can be determined at compile time. And there should be no System calls because these are actions that relate to performing some task upon the operating system. Therefore OS operations, such as file read/write or OS queries like time and date, are not supported.

Having said that, the workload that are less likely to be accelerator-compatible are mostly communication intensive kernels. Such communication intensive kernels often lead to a serious data transfer overhead compared to the CPU execution, which can probably be detected by the CPU-FPGA or CPU-GPU communication time measurement.

For better understanding, let's take the following example:

Communication Intensive Breadth-First Search (BFS):

 1  procedure BFS(G, root) is
 2      let Q be a queue
 3      label root as explored
 4      Q.enqueue(root)
 5      while Q is not empty do
 6          v := Q.dequeue()
 7          if v is the goal then
 8              return v
 9          for all edges from v to w in G.adjacentEdges(v) do
10              if w is not labeled as explored then
11                  label w as explored
12                  Q.enqueue(w)

The above pseudo code is of famous bread-first search (BFS). Why it's not a good candidate for acceleration? Because it traverses all the nodes in a graph without doing any significant computation. Hence it's immensely communication intensive as compared to compute intensive. Furthermore, for a data-driven algorithm like
BFS, the shape and structure of the input can actually dictate runtime characteristics like locality and branch behaviour , making it not so good candidate for hardware acceleration.

Now the question arises why have I focused on compute intensive vs communication intensive?

As you have tagged FPGA in your post, I can explain you this concept with respect to FPGA. For instance in a given system that uses the PCIe connection between the CPU and FPGA, we calculate the PCIe transfer time as the elapsed time of data movement from the host memory to the device memory through PCIe-based direct memory access (DMA).

The PCIe transfer time is a significant factor to filter out the FPGA acceleration for communication bounded workload. Therefore, the above mentioned BFS can show severe PCIe transfer overheads and hence, not acceleration compatible.

On the other hand, consider a the family of object recognition algorithms implemented as a deep neural network. If you go through these algorithms you will find that a significant amount of time (more than 90% may be) is spent in the convolution function. The input data is relatively small. The convolutions are embarrassingly parallel. And this makes it them ideal workload for moving to hardware accelerator.

Let's take another example showing a perfect workload for hardware acceleration:

Compute Intensive General Matrix Multiply (GEMM):

void gemm(TYPE m1[N], TYPE m2[N], TYPE prod[N]){
    int i, k, j, jj, kk;
    int i_row, k_row;
    TYPE temp_x, mul;

    loopjj:for (jj = 0; jj < row_size; jj += block_size){
        loopkk:for (kk = 0; kk < row_size; kk += block_size){
            loopi:for ( i = 0; i < row_size; ++i){
                loopk:for (k = 0; k < block_size; ++k){
                    i_row = i * row_size;
                    k_row = (k  + kk) * row_size;
                    temp_x = m1[i_row + k + kk];
                    loopj:for (j = 0; j < block_size; ++j){
                        mul = temp_x * m2[k_row + j + jj];
                        prod[i_row + j + jj] += mul;
                    }
                }
            }
        }
    }
}

The above code example is General Matrix Multiply (GEMM). It is a common algorithm in linear algebra, machine learning, statistics, and many other domains. The matrix multiplication in this code is more commonly computed using a blocked
loop structure. Commuting the arithmetic to reuse all of the elements
in one block before moving onto the next dramatically
improves memory locality. Hence it is an extremely compute intensive and perfect candidate for acceleration.

Hence, to name only few, we can conclude following are the deciding factors for hardware acceleration:

the load of your workload
the data your workload accesses,
how parallel is your workload
the underlying silicon available for acceleration
the bandwidth and latency of communication channels.

Do not forget Amdahl's Law:

Even if you have found out the right workload that is an ideal candidate for hardware acceleration, the struggle does not end here. Why? Because the famous Amdahl's law comes into play. Meaning, you might be able to significantly speed up a workload, but if it is only 2% of the runtime of the application, then even if you speed it up infinitely (take the run time to 0) you will only speed the overall application by 2% at the system level. Hence, your ideal workload should not only be an ideal workload algorithmically, in fact, it should also be contributing significantly to the overall runtime of your system.

回复收藏 0 原文

~没有更多了~