计算矩阵每个元素的指数的最有效方法

发布于 2024-09-11 08:33:11 字数 1634 浏览 6 评论 0 原文

我正在从 Matlab 迁移到 C + GSL,我想知道计算矩阵 B 的最有效方法是什么:

B[i][j] = exp(A[i][j])

其中 i 在 [0, Ny] 中,j 在 [0, Ny] 中NX]。

请注意,这与矩阵指数不同:

B = exp(A)

可以使用 GSL (linalg.h) 中的一些不稳定/不受支持的代码来完成。

我刚刚找到了蛮力解决方案(几个“for”循环),但是有没有更聪明的方法来做到这一点?

编辑

德鲁霍尔解决方案帖子的结果

所有结果都来自1024x1024 for(for) 循环,其中每次迭代都会分配两个 double 值(复数)。 时间是100次执行的平均时间

  • 考虑到 {Row,Column}-Major 模式来存储矩阵时的结果:
    • 在 Row-Major 模式下循环内循环中的行时需要 226.56 毫秒(情况 1)。
    • 在 Row-Major 模式下循环遍历内循环中的列时为 223.22 毫秒(情况 2)。
    • 使用 GSL 提供的 gsl_matrix_complex_set 函数时为 224.60 毫秒(情况 3)。

案例 1 的源代码

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
        matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
    }
}

案例 2 的源代码

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
        matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
    }
}

案例 3 的源代码

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        gsl_matrix_complex_set(matrix, i, j, c_value);
    }
}

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:

B[i][j] = exp(A[i][j])

where i in [0, Ny] and j in [0, Nx].

Notice that this is different from matrix exponential:

B = exp(A)

which can be accomplished with some unstable/unsupported code in GSL (linalg.h).

I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?

EDIT

Results from the solution post of Drew Hall

All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.

  • Results when taking into account the {Row,Column}-Major mode to store the matrix:
    • 226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
    • 223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
    • 224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).

Source code for case 1:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
        matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 2:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
        matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 3:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        gsl_matrix_complex_set(matrix, i, j, c_value);
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

风尘浪孓 2024-09-18 08:33:11

没有办法避免迭代所有元素并对每个元素调用 exp() 或等效方法。但迭代的方式有更快和更慢的。

特别是,您的目标应该是最大限度地减少缓存未命中。查明您的数据是否按行优先顺序或列优先顺序存储,并确保安排循环,以便内部循环迭代内存中连续存储的元素,而外部循环则大步前进到下一行(如果行主要)或列(如果列主要)。虽然这看起来微不足道,但它可以在性能上产生巨大的差异(取决于矩阵的大小)。

处理完缓存后,您的下一个目标是消除循环开销。第一步(如果您的矩阵 API 支持)是从嵌套循环(M 和 N 范围)转变为迭代基础数据(MN 范围)的单个循环。为此,您需要获取指向底层内存块的原始指针(即 double 而不是 double**)。

最后,加入一些循环展开(即,为循环的每次迭代执行 8 或 16 个元素)以进一步减少循环开销,这可能是您能做到的最快速度。您可能需要一个带有fall-through的最终switch语句来清理剩余元素(当您的数组大小%块大小!= 0时)。

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.

In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).

Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.

Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).

念﹏祤嫣 2024-09-18 08:33:11

不,除非有一些我没听说过的奇怪的数学怪癖,否则您几乎只需要使用两个 for 循环来遍历元素即可。

No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.

夢归不見 2024-09-18 08:33:11

如果您只想将 exp 应用于数字数组,那么确实没有捷径。你必须调用它 (Nx * Ny) 次。如果某些矩阵元素很简单,例如 0,或者有重复的元素,一些记忆可能会有所帮助。

但是,如果您真正想要的是矩阵指数(这非常有用),我们依赖的算法是 DGPADM。它是 Fortran 语言,但您可以使用 f2c 将其转换为 C。< a href="http://www.maths.uq.edu.au/expokit/paper.pdf" rel="nofollow noreferrer">这是关于它的论文。

If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.

However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.

初见你 2024-09-18 08:33:11

由于循环的内容尚未显示,因此计算 c_value 的位我们不知道代码的性能是否受内存带宽限制或受 CPU 限制。唯一确定的方法是使用分析器,而且是一个复杂的分析器。它需要能够测量内存延迟,即 CPU 空闲等待数据从 RAM 到达的时间量。

如果您受到内存带宽的限制,那么一旦顺序访问内存,您就无能为力了。当数据按顺序获取时,CPU 和内存工作得最好。随机访问会影响吞吐量,因为数据更有可能从 RAM 提取到缓存中。您始终可以尝试获得更快的 RAM。

如果您受到 CPU 的限制,那么还有更多选项可供您选择。使用 SIMD 是一种选择,手动编码浮点代码也是一种选择(由于多种原因,C/C++ 编译器不擅长 FPU 代码)。如果是我,并且内部循环中的代码允许这样做,我将有两个指向数组的指针,一个位于数组的开头,另一个位于数组的 4/5 处。每次迭代,将使用第一个指针执行 SIMD 操作,并使用第二个指针执行标量 FPU 操作,以便循环的每次迭代执行五个值。然后,我将 SIMD 指令与 FPU 指令交错以减少延迟成本。这不会影响您的缓存,因为(至少在 Pentium 上)MMU 可以同时传输最多四个数据流(即在没有任何提示或特殊指令的情况下为您预取数据)。

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.

If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.

If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文