从全局点读取时对 CUDA 矩阵执行操作
嘿, 我有一个数学函数(多维,这意味着有一个索引,我将其传递给我想要返回的单个数学函数的 C++ 函数。例如,假设我有一个这样的数学函数:
f = Vector(x^2*y^2 / y^2 / x^2*z^2)
我会这样实现它:
double myFunc(int function_index)
{
switch(function_index)
{
case 1:
return PNT[0]*PNT[0]*PNT[1]*PNT[1];
case 2:
return PNT[1]*PNT[1];
case 3:
return PNT[2]*PNT[2]*PNT[1]*PNT[1];
}
}
而PNT
的全局定义如下:double PNT[ NUM_COORDINATES ]
现在我想为每个坐标实现每个函数的导数,从而生成导数矩阵(列 = 坐标; rows = 单个函数)。我已经编写了迄今为止有效的内核,其调用的 myFunc()
是:为了计算数学子函数 i 关于坐标 j 的导数,我会这样做。在顺序模式下使用(例如在 CPU 上)以下代码(虽然这被简化了,因为通常您会减少 h 直到达到导数的某个精度):
f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;
现在,当我想在 GPU 上并行执行此操作时,问题是即将推出:如何处理 PNT?由于我必须将某些坐标增加 h,计算该值,然后再次减少它,因此出现了一个问题:如何在不“干扰”其他线程的情况下做到这一点?我无法修改 PNT
因为其他线程需要“原始”点来修改自己的坐标。
我的第二个想法是为每个线程保存一个修改点,但我很快就放弃了这个想法,因为当并行使用数千个线程时,这是一个非常糟糕且可能很慢的想法(由于内存限制,可能根本无法实现) 。
“最终”解决方案 因此,我目前的做法如下,它在运行时通过预处理器宏将值“add”添加到由 coord_index 标识的坐标(而不将其存储在某处)。
#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
//*// Example: f[i] = x[i]^3
return (X(function_index)*X(function_index)*X(function_index));
// */
}
这工作得非常好而且快。当使用具有 10000 个函数和 10000 个坐标的导数矩阵时,只需要 0.5seks。 PNT
可以全局定义,也可以定义为常量内存,如 __constant__ double PNT[ NUM_COORDINATES ];
,具体取决于预处理器变量 USE_CONST
。 return (X(function_index)*X(function_index)*X(function_index)); 行只是一个示例,其中每个子函数看起来都具有相同的方案,从数学上来说:
f = Vector(x0^3 / x1^3 / ... / xN^3)
NOW THE BIG出现问题:
myFunc
是一个数学函数,用户应该能够按照自己的喜好来实现。例如,他还可以实现以下数学函数:
f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)
因此每个函数看起来都相同。作为一名程序员,您应该只编码一次,而不应该依赖于实现的数学函数。因此,当上述函数在 C++ 中实现时,它看起来如下所示:
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
double ret = 1.0;
for(int i = 0; i < NUM_COORDINATES; i++)
ret *= X(i)*X(i);
return ret;
}
现在内存访问非常“奇怪”并且对性能问题不利,因为每个线程需要访问 PNT
的每个元素两次。当然,在每个函数看起来都相同的情况下,我可以重写围绕 myFunc 调用的完整算法,但正如我已经说过的:我不想根据用户进行编码- 实现的函数myFunc
...
有人能想出如何解决这个问题的想法吗? 谢谢!
Hey there,
I have a mathematical function (multidimensional which means that there's an index which I pass to the C++-function on which single mathematical function I want to return. E.g. let's say I have a mathematical function like that:
f = Vector(x^2*y^2 / y^2 / x^2*z^2)
I would implement it like that:
double myFunc(int function_index)
{
switch(function_index)
{
case 1:
return PNT[0]*PNT[0]*PNT[1]*PNT[1];
case 2:
return PNT[1]*PNT[1];
case 3:
return PNT[2]*PNT[2]*PNT[1]*PNT[1];
}
}
whereas PNT
is defined globally like that: double PNT[ NUM_COORDINATES ]
. Now I want to implement the derivatives of each function for each coordinate thus generating the derivative matrix (columns = coordinates; rows = single functions). I wrote my kernel already which works so far and which call's myFunc().
The Problem is: For calculating the derivative of the mathematical sub-function i concerning coordinate j, I would use in sequential mode (on CPUs e.g.) the following code (whereas this is simplified because usually you would decrease h until you reach a certain precision of your derivative):
f0 = myFunc(i);
PNT[ j ] += h;
derivative = (myFunc(j)-f0)/h;
PNT[ j ] -= h;
now as I want to do this on the GPU in parallel, the problem is coming up: What to do with PNT? As I have to increase certain coordinates by h, calculate the value and than decrease it again, there's a problem coming up: How to do it without 'disturbing' the other threads? I can't modify PNT
because other threads need the 'original' point to modify their own coordinate.
The second idea I had was to save one modified point for each thread but I discarded this idea quite fast because when using some thousand threads in parallel, this is a quite bad and probably slow (perhaps not realizable at all because of memory limits) idea.
'FINAL' SOLUTION
So how I do it currently is the following, which adds the value 'add' on runtime (without storing it somewhere) via preprocessor macro to the coordinate identified by coord_index
.
#define X(n) ((coordinate_index == n) ? (PNT[n]+add) : PNT[n])
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
//*// Example: f[i] = x[i]^3
return (X(function_index)*X(function_index)*X(function_index));
// */
}
That works quite nicely and fast. When using a derivative matrix with 10000 functions and 10000 coordinates, it just takes like 0.5seks. PNT
is defined either globally or as constant memory like __constant__ double PNT[ NUM_COORDINATES ];
, depending on the preprocessor variable USE_CONST
.
The line return (X(function_index)*X(function_index)*X(function_index));
is just an example where every sub-function looks the same scheme, mathematically spoken:
f = Vector(x0^3 / x1^3 / ... / xN^3)
NOW THE BIG PROBLEM ARISES:
myFunc
is a mathematical function which the user should be able to implement as he likes to. E.g. he could also implement the following mathematical function:
f = Vector(x0^2*x1^2*...*xN^2 / x0^2*x1^2*...*xN^2 / ... / x0^2*x1^2*...*xN^2)
thus every function looking the same. You as a programmer should only code once and not depending on the implemented mathematical function. So when the above function is being implemented in C++, it looks like the following:
__device__ double myFunc(int function_index, int coordinate_index, double add)
{
double ret = 1.0;
for(int i = 0; i < NUM_COORDINATES; i++)
ret *= X(i)*X(i);
return ret;
}
And now the memory accesses are very 'weird' and bad for performance issues because each thread needs access to each element of PNT
twice. Surely, in such a case where each function looks the same, I could rewrite the complete algorithm which surrounds the calls to myFunc
, but as I stated already: I don't want to code depending on the user-implemented function myFunc
...
Could anybody come up with an idea how to solve this problem??
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
回到开头并从一张干净的纸开始,您似乎希望能够做两件事来
输入数组上的函数
输入数组上的值函数
使用一阶精确有限差分
虽然该函数是标量值和任意的,但实际上该函数似乎可以采用两种明确的形式:
您似乎我们已经从第一种类型的函数开始,并编写了代码来处理函数和近似导数的计算,现在正在努力解决如何使用相同的代码处理第二种情况的问题。
如果这是对问题的合理总结,请在评论中指出,我将继续用一些代码示例和概念来扩展它。如果不是的话,我会在几天内删除它。
在评论中,我一直试图建议将第一种类型的功能与第二种类型的功能混为一谈并不是一个好方法。并行执行的正确性要求以及在 GPU 上提取并行性和性能的最佳方式有很大不同。通过在具有不同使用模型的两个不同代码框架中分别处理这两种类型的函数,您会得到更好的服务。当需要实现给定的数学表达式时,“用户”应该对该表达式是否类似于第一类型函数的模型或第二类型函数的模型进行基本分类。分类行为是驱动代码中算法选择的因素。这种类型的“按算法分类”在精心设计的库中几乎是通用的 - 您可以在 Boost 和 STL 等 C++ 模板库中找到它,也可以在 BLAS 等传统 Fortran 代码中找到它。
Rewinding back to the beginning and starting with a clean sheet, it seems you want to be able to do two things
function over an input array
valued function over the input array
using first order accurate finite differencing
While the function is scalar valued and arbitrary, it seems that there are, in fact, two clear forms which this function can take:
You appeared to have started with the first type of function and have put together code to deal with computing both the function and the approximate derivative, and are now wrestling with the problem of how to deal with the second case using the same code.
If this is a reasonable summary of the problem, then please indicate so in a comment and I will continue to expand it with some code samples and concepts. If it isn't, I will delete it in a few days.
In comments, I have been trying to suggest that conflating the first type of function with the second is not a good approach. The requirements for correctness in parallel execution, and the best way of extracting parallelism and performance on the GPU are very different. You would be better served by treating both types of functions separately in two different code frameworks with different usage models. When a given mathematical expression needs to be implemented, the "user" should make a basic classification as to whether that expression is like the model of the first type of function, or the second. The act of classification is what drives algorithmic selection in your code. This type of "classification by algorithm" is almost universal in well designed libraries - you can find it in C++ template libraries like Boost and the STL, and you can find it in legacy Fortran codes like the BLAS.