浮点乘法执行速度较慢,具体取决于 C 中的操作数
我正在对之前从文件中读取的矩阵执行模板计算。我使用两种不同类型的矩阵(非零类型和零类型)。两种类型共享边界值(通常为 1000),而其余元素对于零类型为 0,对于非零类型为 1。
该代码将文件的矩阵存储在两个分配的相同大小的矩阵中。然后,它使用自己的值和邻居的值对一个矩阵的每个元素执行运算(加 x 4 和乘 x 1),并将结果存储在第二个矩阵中。一旦计算完成,矩阵的指针就会被交换,并且相同的操作会执行有限的次数。这里有核心代码:
#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]
for (cur_time=0; cur_time<timeSteps; cur_time++) {
for (i=1; i<rows-1; i++) {
for (j=1; j<cols-1; j++) {
PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
}
}
// Change pointers for next iteration
auxP = wMat;
wMat = rMat;
rMat = auxP;
}
我公开的案例使用固定数量的 500 个 timeSteps(外部迭代)和 8192 行和 8192 列的矩阵大小,但在更改 timeSteps 数量或矩阵大小时问题仍然存在。请注意,我只测量算法这个具体部分的时间,因此从文件中读取矩阵或其他任何东西都会影响时间测量。
发生的情况是,根据我使用的矩阵类型,我得到的时间不同,使用零类型时获得的性能要差得多(所有其他矩阵的性能与非零类型相同,因为我已经尝试生成一个充满随机数的矩阵值)。
我确信这是乘法运算,就好像我删除它并只留下加法一样,它们执行相同的操作。请注意,对于零矩阵类型,大多数类型的总和结果将为 0,因此运算将为“0.2*0”。
这种行为对我来说当然很奇怪,因为我认为浮点运算与操作数的值无关,但这里看起来情况并非如此。我还尝试捕获并显示 SIGFPE 异常,以防出现问题,但我没有得到任何结果。
如果有帮助的话,我正在使用 Intel Nehalem 处理器和 gcc 4.4.3。
I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.
The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:
#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]
for (cur_time=0; cur_time<timeSteps; cur_time++) {
for (i=1; i<rows-1; i++) {
for (j=1; j<cols-1; j++) {
PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
}
}
// Change pointers for next iteration
auxP = wMat;
wMat = rMat;
rMat = auxP;
}
The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.
What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).
I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".
This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.
In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题已经大部分被诊断出来,但我会准确地写下这里发生的情况。
本质上,提问者正在模拟扩散;边界上的初始量扩散到整个大网格中。在每个时间步 t,扩散前沿的值将为 0.2^t(忽略拐角处的影响)。
最小归一化单精度值为2^-126;当cur_time = 55时,扩散前沿的值为0.2^55,比2^-127要小一些。从这个时间步长开始,网格中的一些单元格将包含非正常值。在提问者的 Nehalem 中,对非正规数据的操作比对正规化浮点数据的相同操作慢大约 100 倍,这解释了速度减慢的原因。
当网格最初填充
1.0
的常量数据时,数据永远不会变得太小,因此避免了非正常停顿。请注意,将数据类型更改为 double 会延迟,但不会缓解问题。如果使用双精度进行计算,非正规值(现在小于 2^-1022)将在第 441 次迭代中首先出现。
以扩散前沿的精度为代价,您可以通过启用“刷新至零”来解决速度减慢问题,这会导致处理器在算术运算中产生零而不是非正常结果。这是通过切换 FPSCR 或 MXSCR 中的位来完成的,最好是通过 C 库中
标头中定义的函数。另一个(更黑客,不太好的)“修复”是最初用非常小的非零值(
0x1.0p-126f
,最小的正常数)填充矩阵。这也可以防止计算中出现非正规现象。The problem has already mostly been diagnosed, but I will write up exactly what happens here.
Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).
The smallest normalized single-precision value is 2^-126; when
cur_time = 55
, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.When the grid is initially filled with constant data of
1.0
, the data never gets too small, and so the denormal stall is avoided.Note that changing the data type to
double
would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the
<fenv.h>
header in the C library.Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (
0x1.0p-126f
, the smallest normal number). This would also prevent denormals from arising in the computation.也许您的 ZeroMatrix 使用稀疏矩阵的典型存储方案:将每个非零值存储在链接列表中。如果是这样的话,那么它的性能比典型的基于数组的存储方案差是完全可以理解的:因为它需要为您执行的每个操作运行一次链接列表。在这种情况下,您可以通过使用考虑稀疏矩阵的矩阵乘法算法来加快该过程。如果不是这种情况,请发布最少但完整的代码,以便我们可以使用它。
这是有效乘以稀疏矩阵的可能性之一:
Maybe your ZeroMatrix uses the typical storage scheme for Sparse Matrices: store every non-zero value in a linked list. If that is the case, it is quite understandable why it performs worse than a typical array-based storage-scheme: because it needs to run thru the linked list once for every operation you perform. In that case you can maybe speed the process up by using a matrix-multiply-algorithm that accounts for having a sparse-matrix. If this is not the case please post minimal but complete code so we can play with it.
here is one of the possibilities for multiplying sparse matrices efficiently: