计算矩阵每个元素的指数的最有效方法
我正在从 Matlab 迁移到 C + GSL,我想知道计算矩阵 B 的最有效方法是什么:
B[i][j] = exp(A[i][j])
其中 i 在 [0, Ny] 中,j 在 [0, Ny] 中NX]。
请注意,这与矩阵指数不同:
B = exp(A)
可以使用 GSL (linalg.h) 中的一些不稳定/不受支持的代码来完成。
我刚刚找到了蛮力解决方案(几个“for”循环),但是有没有更聪明的方法来做到这一点?
编辑
德鲁霍尔解决方案帖子的结果
所有结果都来自1024x1024 for(for)
循环,其中每次迭代都会分配两个 double
值(复数)。 时间是100次执行的平均时间。
- 考虑到 {Row,Column}-Major 模式来存储矩阵时的结果:
- 在 Row-Major 模式下循环内循环中的行时需要 226.56 毫秒(情况 1)。
- 在 Row-Major 模式下循环遍历内循环中的列时为 223.22 毫秒(情况 2)。
- 使用 GSL 提供的
gsl_matrix_complex_set
函数时为 224.60 毫秒(情况 3)。
案例 1 的源代码:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
案例 2 的源代码:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
案例 3 的源代码:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
没有办法避免迭代所有元素并对每个元素调用
exp()
或等效方法。但迭代的方式有更快和更慢的。特别是,您的目标应该是最大限度地减少缓存未命中。查明您的数据是否按行优先顺序或列优先顺序存储,并确保安排循环,以便内部循环迭代内存中连续存储的元素,而外部循环则大步前进到下一行(如果行主要)或列(如果列主要)。虽然这看起来微不足道,但它可以在性能上产生巨大的差异(取决于矩阵的大小)。
处理完缓存后,您的下一个目标是消除循环开销。第一步(如果您的矩阵 API 支持)是从嵌套循环(M 和 N 范围)转变为迭代基础数据(MN 范围)的单个循环。为此,您需要获取指向底层内存块的原始指针(即 double 而不是 double**)。
最后,加入一些循环展开(即,为循环的每次迭代执行 8 或 16 个元素)以进一步减少循环开销,这可能是您能做到的最快速度。您可能需要一个带有fall-through的最终switch语句来清理剩余元素(当您的数组大小%块大小!= 0时)。
There's no way to avoid iterating over all the elements and calling
exp()
or equivalent on each one. But there are faster and slower ways to iterate.In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
不,除非有一些我没听说过的奇怪的数学怪癖,否则您几乎只需要使用两个 for 循环来遍历元素即可。
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
如果您只想将
exp
应用于数字数组,那么确实没有捷径。你必须调用它 (Nx * Ny) 次。如果某些矩阵元素很简单,例如 0,或者有重复的元素,一些记忆可能会有所帮助。但是,如果您真正想要的是矩阵指数(这非常有用),我们依赖的算法是 DGPADM。它是 Fortran 语言,但您可以使用 f2c 将其转换为 C。< a href="http://www.maths.uq.edu.au/expokit/paper.pdf" rel="nofollow noreferrer">这是关于它的论文。
If you just want to apply
exp
to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
由于循环的内容尚未显示,因此计算 c_value 的位我们不知道代码的性能是否受内存带宽限制或受 CPU 限制。唯一确定的方法是使用分析器,而且是一个复杂的分析器。它需要能够测量内存延迟,即 CPU 空闲等待数据从 RAM 到达的时间量。
如果您受到内存带宽的限制,那么一旦顺序访问内存,您就无能为力了。当数据按顺序获取时,CPU 和内存工作得最好。随机访问会影响吞吐量,因为数据更有可能从 RAM 提取到缓存中。您始终可以尝试获得更快的 RAM。
如果您受到 CPU 的限制,那么还有更多选项可供您选择。使用 SIMD 是一种选择,手动编码浮点代码也是一种选择(由于多种原因,C/C++ 编译器不擅长 FPU 代码)。如果是我,并且内部循环中的代码允许这样做,我将有两个指向数组的指针,一个位于数组的开头,另一个位于数组的 4/5 处。每次迭代,将使用第一个指针执行 SIMD 操作,并使用第二个指针执行标量 FPU 操作,以便循环的每次迭代执行五个值。然后,我将 SIMD 指令与 FPU 指令交错以减少延迟成本。这不会影响您的缓存,因为(至少在 Pentium 上)MMU 可以同时传输最多四个数据流(即在没有任何提示或特殊指令的情况下为您预取数据)。
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).