如何在 C# 中进行快速复杂算术
我现在正在开发一个 C# 分形生成器项目,该项目需要大量复数算术,并且我正在尝试寻找加快数学计算速度的方法。下面是一组简化的代码,它使用三种数据存储方法之一测试 Mandelbrot 计算的速度,如 TestNumericsComplex
、TestCustomComplex
和 TestPairedDoubles
所示。代码>.请理解,Mandelbrot 只是一个示例,我希望未来的开发人员能够创建插件分形公式。
基本上我认为使用 System.Numerics.Complex 是一个不错的主意,而使用一对双精度数或自定义 Complex 结构是可以接受的主意。我可以使用 GPU 执行算术,但这不会限制或破坏可移植性吗?我尝试改变内部循环的顺序(i,x,y)但无济于事。我还能做些什么来帮助加速内循环?我遇到页面错误问题了吗?与浮点值相比,使用定点数系统是否可以获得任何速度?
我已经知道 C# 4.0 中的 Parallel.For
;为了清楚起见,我的代码示例中省略了它。我还知道 C# 通常不是一种高性能语言;我使用 C# 来利用插件的反射和窗口化的 WPF。
using System;
using System.Diagnostics;
namespace SpeedTest {
class Program {
private const int ITER = 512;
private const int XL = 1280, YL = 1024;
static void Main(string[] args) {
var timer = new Stopwatch();
timer.Start();
//TODO use one of these two lines
//TestCustomComplex();
//TestNumericsComplex();
//TestPairedDoubles();
timer.Stop();
Console.WriteLine(timer.ElapsedMilliseconds);
Console.ReadKey();
}
/// <summary>
/// ~14000 ms on my machine
/// </summary>
static void TestNumericsComplex() {
var vals = new System.Numerics.Complex[XL,YL];
var loc = new System.Numerics.Complex[XL,YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new System.Numerics.Complex((x - XL/2)/256.0, (y - YL/2)/256.0);
vals[x, y] = new System.Numerics.Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if(vals[x,y].Real>4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
/// <summary>
/// ~17000 on my machine
/// </summary>
static void TestPairedDoubles() {
var vals = new double[XL, YL, 2];
var loc = new double[XL, YL, 2];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y, 0] = (x - XL / 2) / 256.0;
loc[x, y, 1] = (y - YL / 2) / 256.0;
vals[x, y, 0] = 0;
vals[x, y, 1] = 0;
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y, 0] > 4) continue;
var a = vals[x, y, 0] * vals[x, y, 0] - vals[x, y, 1] * vals[x, y, 1];
var b = vals[x, y, 0] * vals[x, y, 1] * 2;
vals[x, y, 0] = a + loc[x, y, 0];
vals[x, y, 1] = b + loc[x, y, 1];
}
}
}
/// <summary>
/// ~16900 ms on my machine
/// </summary>
static void TestCustomComplex() {
var vals = new Complex[XL, YL];
var loc = new Complex[XL, YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
vals[x, y] = new Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y].Real > 4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
}
public struct Complex {
public double Real, Imaginary;
public Complex(double a, double b) {
Real = a;
Imaginary = b;
}
public static Complex operator + (Complex a, Complex b) {
return new Complex(a.Real + b.Real, a.Imaginary + b.Imaginary);
}
public static Complex operator * (Complex a, Complex b) {
return new Complex(a.Real*b.Real - a.Imaginary*b.Imaginary, a.Real*b.Imaginary + a.Imaginary*b.Real);
}
}
}
编辑
GPU似乎是唯一可行的解决方案;我忽略与 C/C++ 的互操作性,因为我认为速度的提升不足以迫使我强制未来插件的互操作性。
在研究了可用的 GPU 选项(我实际上已经研究了一段时间)后,我终于找到了我认为是一个很好的折衷方案。我选择 OpenCL 是希望在我的程序发布时大多数设备都支持该标准。 OpenCLTemplate 使用 cloo 提供 .Net 之间易于理解的接口(用于应用程序逻辑)和“OpenCL C99”(用于并行代码)。插件可以包含用于硬件加速的 OpenCL 内核以及 System.Numerics.Complex 的标准实现,以便于集成。
我预计,随着处理器供应商采用该标准,有关编写 OpenCL C99 代码的可用教程的数量将迅速增长。这使我无需对插件开发人员强制执行 GPU 编码,同时为他们提供格式良好的语言(如果他们选择利用该选项)。这也意味着 IronPython 脚本将具有同等的 GPU 加速访问权限,尽管在编译时之前未知,因为代码将直接通过 OpenCL 进行转换。
对于未来有兴趣将 GPU 加速与 .Net 项目集成的任何人,我强烈推荐 OpenCLTemplate。学习 OpenCL C99 确实有一定的开销。然而,它只比学习替代 API 稍微困难一点,并且可能会得到示例和一般社区的更好支持。
I'm working on a C# Fractal Generator project right now that requires lots of Arithmetic with Complex numbers, and I'm trying to think of ways to speed up the math. Below is a simplified set of code that tests the speed of a Mandelbrot calculation using one of three data storage methods, shown in TestNumericsComplex
, TestCustomComplex
, and TestPairedDoubles
. Please understand that the Mandelbrot is just an example, and I intend for future developers to be able to create plug-in fractal formulas.
Basically I see that using System.Numerics.Complex
is an ok idea, while using a pair of doubles or a custom Complex struct are passable ideas. I can perform the arithmetic using the gpu, but wouldn't that limit or break portability? I've tried varying the order of the inner loops (i, x, y) to no avail. What else can I do to help speed up the inner loops? Am I running into page fault issues? Would using a fixed-point number system gain me any speed as opposed to the floating-point values?
I'm already aware of Parallel.For
in C# 4.0; it is omitted from my code samples for clarity. I'm also aware that C# is not usually a good language for high-performance; I'm using C# to take advantage of Reflection for plugins and WPF for windowing.
using System;
using System.Diagnostics;
namespace SpeedTest {
class Program {
private const int ITER = 512;
private const int XL = 1280, YL = 1024;
static void Main(string[] args) {
var timer = new Stopwatch();
timer.Start();
//TODO use one of these two lines
//TestCustomComplex();
//TestNumericsComplex();
//TestPairedDoubles();
timer.Stop();
Console.WriteLine(timer.ElapsedMilliseconds);
Console.ReadKey();
}
/// <summary>
/// ~14000 ms on my machine
/// </summary>
static void TestNumericsComplex() {
var vals = new System.Numerics.Complex[XL,YL];
var loc = new System.Numerics.Complex[XL,YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new System.Numerics.Complex((x - XL/2)/256.0, (y - YL/2)/256.0);
vals[x, y] = new System.Numerics.Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if(vals[x,y].Real>4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
/// <summary>
/// ~17000 on my machine
/// </summary>
static void TestPairedDoubles() {
var vals = new double[XL, YL, 2];
var loc = new double[XL, YL, 2];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y, 0] = (x - XL / 2) / 256.0;
loc[x, y, 1] = (y - YL / 2) / 256.0;
vals[x, y, 0] = 0;
vals[x, y, 1] = 0;
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y, 0] > 4) continue;
var a = vals[x, y, 0] * vals[x, y, 0] - vals[x, y, 1] * vals[x, y, 1];
var b = vals[x, y, 0] * vals[x, y, 1] * 2;
vals[x, y, 0] = a + loc[x, y, 0];
vals[x, y, 1] = b + loc[x, y, 1];
}
}
}
/// <summary>
/// ~16900 ms on my machine
/// </summary>
static void TestCustomComplex() {
var vals = new Complex[XL, YL];
var loc = new Complex[XL, YL];
for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
vals[x, y] = new Complex(0, 0);
}
for (int i = 0; i < ITER; i++) {
for (int x = 0; x < XL; x++)
for (int y = 0; y < YL; y++) {
if (vals[x, y].Real > 4) continue;
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
}
}
}
}
public struct Complex {
public double Real, Imaginary;
public Complex(double a, double b) {
Real = a;
Imaginary = b;
}
public static Complex operator + (Complex a, Complex b) {
return new Complex(a.Real + b.Real, a.Imaginary + b.Imaginary);
}
public static Complex operator * (Complex a, Complex b) {
return new Complex(a.Real*b.Real - a.Imaginary*b.Imaginary, a.Real*b.Imaginary + a.Imaginary*b.Real);
}
}
}
EDIT
GPU seems to be the only feasible solution; I disregard interoperability with C/C++ because I don't feel the speed up would be significant enough to coerce me to forcing interoperability on future plugins.
After looking into the available GPU options (which I've actually been examining for some time now), I've finally found what I believe is an excellent compromise. I've chosen OpenCL with the hope that most devices will support the standard by the time my program is released. OpenCLTemplate uses cloo to provide an easy-to-understand interface between .Net (for application logic) and "OpenCL C99" (for parallel code). Plugins can include OpenCL kernels for hardware acceleration alongside the standard implementation with System.Numerics.Complex for ease of integration.
I expect the number of available tutorials on writing OpenCL C99 code to grow rapidly as the standard becomes adopted by processor vendors. This keeps me from needing to enforce GPU coding on plugin developers while providing them with a well formulated language should they choose to take advantage of the option. It also means that IronPython scripts will have equal access to GPU acceleration despite being unknown until compile-time, since the code will translate directly through OpenCL.
For anyone in the future interested in integrating GPU acceleration with a .Net project, I highly recommend OpenCLTemplate. There is an admitted overhead of learning OpenCL C99. However, it is only slightly harder than learning an alternative API and will likely have better support from examples and general communities.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为你最好的选择是考虑将这些计算卸载到显卡上。 openCL 可以使用显卡来完成这类事情,也可以使用 openGL 着色器。
要真正利用这一点,您需要并行计算。假设您想要对 100 万个数字求平方根(很简单,我知道,但原理是相同的)。在 CPU 上,您一次只能执行一个操作,或者算出您有多少个核心(合理地期望是 8 个核心),并让每个核心对数据的子集执行计算。
例如,如果您将计算卸载到显卡,您将“输入”数据,例如空间中的一堆 1/4 百万个 3D 点(每个顶点四个浮点数),然后让顶点着色器计算平方每个顶点的每个 xyzw 的根。显卡拥有更多的核心,即使只有 100 个,它仍然可以同时处理比 CPU 更多的数字。
如果您愿意,我可以用更多信息来充实这一点,尽管我不期望使用着色器,但我需要以任何方式开始使用它们。
编辑
看看这张相对便宜的卡nvidea GT 220你可以看到它有 48 个“CUDA”核心。这些是您在使用 openCL 和着色器之类的东西时所使用的。
编辑 2
好的,看来您对使用 GPU 加速相当感兴趣。我无法帮助您使用 openCL,也从未研究过它,但我认为它的工作方式与使用着色器的 openGL/DirectX 应用程序大致相同,但没有实际的图形应用程序。我将讨论 DirectX 的方式,因为这就是我所知道的(仅此而已),但根据我的理解,对于 openGL 来说,它或多或少是相同的。
首先,您需要创建一个窗口。当你想要跨平台时,GLUT 可能是最好的选择,它不是世界上最好的库,但它为你提供了一个又好又快的窗口。由于您不会实际显示任何渲染,因此您可以将其设置为一个小窗口,大小足以将标题设置为“硬件加速”之类的内容。
一旦您设置好显卡并准备好渲染内容,您就可以按照 教程来自这里。这将使您进入可以创建 3D 模型并在屏幕上将它们“制作成动画”的阶段。
接下来,您要创建一个用输入数据填充的顶点缓冲区。一个顶点通常是三个(或四个)浮点。如果你的价值观都是独立的,那就很酷。但如果您需要将它们分组在一起,例如您实际上正在使用 2D 向量,那么您需要确保正确“打包”数据。假设您想使用 2D 向量进行数学运算,而 openGL 正在处理 3D 向量,那么 vector.x 和 vector.y 是您实际的输入向量,而 vector.z 只是备用数据。
您会看到,矢量着色器一次只能处理一个矢量,它不能看到多个矢量作为输入,您可以考虑使用可以查看更大数据集的几何着色器。
没错,您设置了一个顶点缓冲区并将其弹出到显卡上。您还需要编写一个“顶点着色器”,这是一个带有类似 C 语言的文本文件,可让您执行一些数学运算。它不是一个完整的 C 实现思想,但它看起来足够像 C,让你知道你在做什么。 openGL 着色器的确切细节超出了我的范围,但我确信很容易找到一个简单的教程。
您自己要做的一件事是找出如何准确地将顶点着色器的输出发送到第二个缓冲区,这实际上是您的输出。顶点着色器不会更改您设置的缓冲区中的顶点数据,该数据是恒定的(就着色器而言),但您可以让着色器输出到第二个缓冲区。
你的计算看起来像这样
我希望这有帮助。就像我说的,我仍在学习这一点,即便如此,我也在学习 DirectX 的方式。
I think your best bet is to look at off loading these calculations to a graphics card. There is openCL that can use graphics cards for this sort of thing, as well as using openGL shaders.
To really take advantage of this, you want to be calculating in parallel. lets say you are wanting to square root (simple I know but the principle is the same) 1 million numbers. On a CPU you can only do one at a time, or work out how many cores you have, reasonable to expect say 8 cores, and have each perform the calculation on a subset of the data.
If you offload your calculation to a graphics card for example, you would 'feed' in you data as say, a bunch of 1/4 million 3D points in space (that's four floats per vertex) and then have a vertex shader calculate the square root of each xyzw of each vertex. a graphics cards has a hell of a lot more cores, even if it was only 100 it can still work on a lot more numbers at once then a CPU.
I can flesh this out with some more info if you want, though I am no expect on use of shaders, but I need to get up to scratch with them any way.
EDIT
looking at this relativeley cheap card an nvidea GT 220 you can see it has 48 'CUDA' cores. These are what you are using when you use things like openCL and shaders.
EDIT 2
Ok, so it seems your fairly interested in using GPU acceleration. I can't help you with using openCL, never looked into it, but I assume it will work much the same openGL/DirectX applications that make use of shaders but with out the actual graphics application. I'm going to talk about the DirectX way of things, as that is what I know (just about) but from my understanding, it is more or less the same all the way for openGL.
First, you need to create a window. as you want cross platform, GLUT is probably the best way to go, its not the best library in the world, but it gives you a window nice and fast. As you are not going to actually show any rendering, you could just make it a tiny window, just big enough to set he title to something like "HARDWARE ACCELERATING".
Once you have your graphics card set up and ready to render stuff with, you get to this stage by following tutorials from here. This will get you to the stage where you can create 3D models and 'animate' them on screen.
Next you want to create a vertex buffer that you populate with input data. a vertex would normally be three (or four) floats. If you values are all independent, that's cool. but if you need to group them together, say if you are in fact working with 2D vectors, then you need to make sure you 'pack' the data correctly. say you want to do maths with 2D vectors, and openGL is working with 3D vectors, then vector.x and vector.y are your actually input vector and vector.z would just be spare data.
You see, the vector shader can only work with one vector at a time, it can't see more then one vector as input, you could look into using a geometry shader which can look at bigger sets of data.
So right, you set up an vertex buffer and pop that over the graphics card. You also need to write a 'vertex shader', this is a text file with a sort of C like language that lets you perform some maths. It is not a full C implementation mind, but it looks enough like C for you to know what your doing. The exact ins and outs of openGL shaders is beyond me, but I am sure a simple tutorial is easy enough to find.
One thing that you are on your own with, is finding out how exactly you can get the output of the vertex shader to go to a second buffer, which is effectively your output. A vertex shader does not change the vertex data in the buffer you set up, that is constant (as far as the shader is concerned) but you can get the shader to output to a second buffer.
your calculation would look something like this
I hope this helps. Like I said, I am still learning this, and even then I am learning the DirectX way of things.
让你的自定义结构可变我获得了 30% 的收益。这减少了调用和内存使用
对于矩阵乘法,您还可以使用“Unsafe {Fixed(){}}”并访问您的数组。
使用这个我为 TestCustomComplex() 获得了 15% 的收益。
Making your custom struct mutable I gained 30%. This reduces calls and memory usage
For Matrix Multiply you can also use 'Unsafe { Fixed(){}}' and access your arrays.
Using this I gained 15% for TestCustomComplex().
就我个人而言,如果这是一个主要问题,我会创建一个 C++ dll,然后用它来进行算术。您可以从 C# 调用此插件,这样您仍然可以利用 WPF 和反射等。
需要注意的一件事是,调用该插件并不完全是“快速”,因此您需要确保一次性传递所有数据并且不经常调用它。
Personally, if this is a major issue, I would create a C++ dll and then use that to do the arithmetic. You can call this plugin from C# so you can still take advantage of WPF and reflection etc.
One thing to note is that calling the plugin isn't exactly a "fast", so you want to ensure you pass ALL your data in one go and not call it very often.