基准测试(python 与 c++ 使用 BLAS)和(numpy)

发布于 2024-12-07 02:06:01 字数 2646 浏览 3 评论 0 原文

我想编写一个广泛使用 BLAS 和 LAPACK 线性代数功能的程序。由于性能是一个问题,我做了一些基准测试,想知道我采取的方法是否合法。

可以这么说,我有三个参赛者,想通过简单的矩阵乘法来测试他们的表现。参赛者是:

  1. Numpy,仅使用 dot 的功能。
  2. Python,通过共享对象调用 BLAS 功能。
  3. C++,通过共享对象调用 BLAS 功能。

场景

我实现了不同维度i的矩阵-矩阵乘法。 i 从 5 运行到 500,增量为 5,矩阵 m1m2 设置如下:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

使用的代码看起来像这样:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python,通过共享对象调用 BLAS

使用该函数,

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

测试代码看起来像这样:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++,通过共享对象调用 BLAS

现在,c++ 代码自然有点长,所以我将信息减少到最小值。
我加载该函数并

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

使用 gettimeofday 测量时间,如下所示:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

其中 j 是运行 20 次的循环。 计算经过的时间

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

我用结果

结果如下图所示:

在此处输入图像描述

问题

  1. 你认为我的方法吗是公平的,还是有一些我可以避免的不必要的开销?
  2. 您是否期望结果会显示 c++ 和 python 方法之间存在如此巨大的差异?两者都使用共享对象进行计算。
  3. 由于我更愿意在我的程序中使用 python,那么在调用 BLAS 或 LAPACK 例程时我可以做些什么来提高性能?

下载

完整的基准测试可以在此处下载。 (JF Sebastian 使该链接成为可能^^)

I would like to write a program that makes extensive use of BLAS and LAPACK linear algebra functionalities. Since performance is an issue I did some benchmarking and would like know, if the approach I took is legitimate.

I have, so to speak, three contestants and want to test their performance with a simple matrix-matrix multiplication. The contestants are:

  1. Numpy, making use only of the functionality of dot.
  2. Python, calling the BLAS functionalities through a shared object.
  3. C++, calling the BLAS functionalities through a shared object.

Scenario

I implemented a matrix-matrix multiplication for different dimensions i. i runs from 5 to 500 with an increment of 5 and the matricies m1 and m2 are set up like this:

m1 = numpy.random.rand(i,i).astype(numpy.float32)
m2 = numpy.random.rand(i,i).astype(numpy.float32)

1. Numpy

The code used looks like this:

tNumpy = timeit.Timer("numpy.dot(m1, m2)", "import numpy; from __main__ import m1, m2")
rNumpy.append((i, tNumpy.repeat(20, 1)))

2. Python, calling BLAS through a shared object

With the function

_blaslib = ctypes.cdll.LoadLibrary("libblas.so")
def Mul(m1, m2, i, r):

    no_trans = c_char("n")
    n = c_int(i)
    one = c_float(1.0)
    zero = c_float(0.0)

    _blaslib.sgemm_(byref(no_trans), byref(no_trans), byref(n), byref(n), byref(n), 
            byref(one), m1.ctypes.data_as(ctypes.c_void_p), byref(n), 
            m2.ctypes.data_as(ctypes.c_void_p), byref(n), byref(zero), 
            r.ctypes.data_as(ctypes.c_void_p), byref(n))

the test code looks like this:

r = numpy.zeros((i,i), numpy.float32)
tBlas = timeit.Timer("Mul(m1, m2, i, r)", "import numpy; from __main__ import i, m1, m2, r, Mul")
rBlas.append((i, tBlas.repeat(20, 1)))

3. c++, calling BLAS through a shared object

Now the c++ code naturally is a little longer so I reduce the information to a minimum.
I load the function with

void* handle = dlopen("libblas.so", RTLD_LAZY);
void* Func = dlsym(handle, "sgemm_");

I measure the time with gettimeofday like this:

gettimeofday(&start, NULL);
f(&no_trans, &no_trans, &dim, &dim, &dim, &one, A, &dim, B, &dim, &zero, Return, &dim);
gettimeofday(&end, NULL);
dTimes[j] = CalcTime(start, end);

where j is a loop running 20 times. I calculate the time passed with

double CalcTime(timeval start, timeval end)
{
double factor = 1000000;
return (((double)end.tv_sec) * factor + ((double)end.tv_usec) - (((double)start.tv_sec) * factor + ((double)start.tv_usec))) / factor;
}

Results

The result is shown in the plot below:

enter image description here

Questions

  1. Do you think my approach is fair, or are there some unnecessary overheads I can avoid?
  2. Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.
  3. Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Download

The complete benchmark can be downloaded here. (J.F. Sebastian made that link possible^^)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

-残月青衣踏尘吟 2024-12-14 02:06:01

更新(2014 年 7 月 30 日):

我在新的 HPC 上重新运行了基准测试。
硬件和软件堆栈都与原始答案中的设置发生了变化。

我将结果放入 Google 电子表格(还包含原始答案的结果)。

硬件

我们的 HPC 有两种不同的节点,一种采用 Intel Sandy Bridge CPU,另一种采用较新的 Ivy Bridge CPU:

Sandy(MKL、OpenBLAS、ATLAS):

  • CPU:2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz(16 核)
  • RAM:64 GB

Ivy(MKL、 OpenBLAS、ATLAS):

  • CPU:2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz(20 核,HT = 40 核)
  • RAM :256 GB

软件

两个节点的软件堆栈相同。使用OpenBLAS代替GotoBLAS2,并且还有一个设置为8个线程(硬编码)的多线程ATLAS BLAS。

  • 操作系统:Suse
  • 英特尔编译器:ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • < Strong>ATLAS:: 3.8.4

点积基准

基准代码如下。不过,对于新机器,我还运行了矩阵大小 50008000 的基准测试。
下表包括原始答案的基准测试结果(重命名:MKL --> Nehalem MKL、Netlib Blas --> Nehalem Netlib BLAS 等)

矩阵乘法 (sizes=[1000,2000,3000,5000,8000])

单线程性能:
单线程性能

多线程性能(8 个线程):
多线程(8 线程)性能

线程与矩阵大小 (Ivy Bridge MKL)
矩阵大小与线程

基准套件

benchmark suite

单线程性能:
在此处输入图像描述

多线程(8 线程)性能:
在此处输入图像描述

结论

新的基准测试结果与原始答案中的结果类似。除了特征值测试之外,OpenBLASMKL 的性能处于同一水平。
特征值测试仅在单线程模式下的OpenBLAS上表现得相当好。
在多线程模式下性能较差。

“矩阵大小与线程图表”还显示,虽然 MKL 以及 OpenBLAS 通常可以很好地随内核/线程数量进行扩展,但它取决于矩阵的大小。对于小型矩阵,添加更多核心不会显着提高性能。

Sandy BridgeIvy Bridge 性能也提高了大约 30%,这可能是由于更高的时钟速率 (+ 0.8 Ghz) 和/或更好的架构。


原始答案(04.10.2011):

前段时间我必须优化一些使用 numpy 和 BLAS 用 python 编写的线性代数计算/算法,所以我对不同的 numpy/BLAS 配置进行了基准测试/测试。

具体来说,我测试了:

  • Numpy 与 ATLAS
  • Numpy 与 GotoBlas2 (1.13)
  • Numpy 与 MKL ( 11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

我确实运行了两个不同的基准测试:

  1. 不同大小矩阵的简单点积
  2. 基准套件可以在此处找到。

以下是我的结果:

机器

Linux(MKL、ATLAS、No-MKL、GotoBlas2):

  • 操作系统:Ubuntu Lucid 10.4 64 位。
  • CPU:2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz(8 核)
  • RAM:24 GB
  • Intel 编译器:11.1 /073
  • Scipy:0.8
  • Numpy:1.5

Mac Book Pro(加速框架):

  • 操作系统:Mac OS X Snow Leopard (10.6)
  • CPU:1 个 Intel Core 2 Duo 2.93 Ghz(2 核)
  • RAM:4 GB
  • >Scipy:0.7
  • Numpy:1.3

Mac Server(加速框架):

  • 操作系统:Mac OS X Snow Leopard 服务器 (10.6)
  • CPU:4 个 Intel(R) Xeon(R) E5520 @ 2.26 Ghz(8 核)
  • RAM:4 GB
  • Scipy >:0.8
  • Numpy:1.5.1

点积基准

代码

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

结果

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Dot 产品基准 - 图表

基准套件

代码
有关基准测试套件的更多信息,请参阅此处< /a>.

结果

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Benchmark suite - Chart

安装

MKL 的安装包括安装完整的英特尔编译器套件非常简单。然而,由于一些错误/问题,使用 MKL 支持配置和编译 numpy 有点麻烦。

GotoBlas2 是一个小包,可以轻松编译为共享库。然而,由于 bug 你有在构建共享库后重新创建共享库,以便将其与 numpy 一起使用。
除此之外,由于某种原因,为多个目标平台构建它并不起作用。因此,我必须为每个我想要优化的 libgoto2.so 文件的平台创建一个 .so 文件。

如果您从 Ubuntu 的存储库安装 numpy,它将自动安装并配置 numpy 以使用 ATLAS。从源代码安装 ATLAS 可能需要一些时间,并且需要一些额外的步骤(fortran 等)。

如果您在具有 FinkMac Ports 的 Mac OS X 计算机上安装 numpy,它会将 numpy 配置为使用 ATLASApple 的 Accelerate框架。
您可以通过在 numpy.core._dotblas 文件上运行 ldd 或调用 numpy.show_config() 进行检查。

结论

MKL 表现最佳,紧随其后的是 GotoBlas2
特征值测试中,GotoBlas2 的表现出奇地比预期差。不知道为什么会出现这种情况。
Apple 的 Accelerate Framework 性能非常好,尤其是在单线程模式下(与其他 BLAS 实现相比)。

GotoBlas2MKL 都可以很好地适应线程数量。因此,如果您必须处理大矩阵,那么在多个线程上运行它会很有帮助。

无论如何,不​​要使用默认的 netlib blas 实现,因为它对于任何严肃的计算工作来说都太慢了。

在我们的集群上,我还安装了 AMD 的 ACML,性能与 MKLGotoBlas2 类似。我没有任何数字。

我个人建议使用GotoBlas2,因为它更容易安装而且是免费的。

如果你想用 C++/C 编写代码,还可以查看 Eigen3 ,它应该是在某些案例中优于MKL/GotoBlas2,并且也很容易使用。

UPDATE (30.07.2014):

I re-run the the benchmark on our new HPC.
Both the hardware as well as the software stack changed from the setup in the original answer.

I put the results in a google spreadsheet (contains also the results from the original answer).

Hardware

Our HPC has two different nodes one with Intel Sandy Bridge CPUs and one with the newer Ivy Bridge CPUs:

Sandy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
  • RAM: 64 GB

Ivy (MKL, OpenBLAS, ATLAS):

  • CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
  • RAM: 256 GB

Software

The software stack is for both nodes the sam. Instead of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS that is set to 8 threads (hardcoded).

  • OS: Suse
  • Intel Compiler: ictce-5.3.0
  • Numpy: 1.8.0
  • OpenBLAS: 0.2.6
  • ATLAS:: 3.8.4

Dot-Product Benchmark

Benchmark-code is the same as below. However for the new machines I also ran the benchmark for matrix sizes 5000 and 8000.
The table below includes the benchmark results from the original answer (renamed: MKL --> Nehalem MKL, Netlib Blas --> Nehalem Netlib BLAS, etc)

Matrix multiplication (sizes=[1000,2000,3000,5000,8000])

Single threaded performance:
single threaded performance

Multi threaded performance (8 threads):
multi-threaded (8 threads) performance

Threads vs Matrix size (Ivy Bridge MKL):
Matrix-size vs threads

Benchmark Suite

benchmark suite

Single threaded performance:
enter image description here

Multi threaded (8 threads) performance:
enter image description here

Conclusion

The new benchmark results are similar to the ones in the original answer. OpenBLAS and MKL perform on the same level, with the exception of Eigenvalue test.
The Eigenvalue test performs only reasonably well on OpenBLAS in single threaded mode.
In multi-threaded mode the performance is worse.

The "Matrix size vs threads chart" also show that although MKL as well as OpenBLAS generally scale well with number of cores/threads,it depends on the size of the matrix. For small matrices adding more cores won't improve performance very much.

There is also approximately 30% performance increase from Sandy Bridge to Ivy Bridge which might be either due to higher clock rate (+ 0.8 Ghz) and/or better architecture.


Original Answer (04.10.2011):

Some time ago I had to optimize some linear algebra calculations/algorithms which were written in python using numpy and BLAS so I benchmarked/tested different numpy/BLAS configurations.

Specifically I tested:

  • Numpy with ATLAS
  • Numpy with GotoBlas2 (1.13)
  • Numpy with MKL (11.1/073)
  • Numpy with Accelerate Framework (Mac OS X)

I did run two different benchmarks:

  1. simple dot product of matrices with different sizes
  2. Benchmark suite which can be found here.

Here are my results:

Machines

Linux (MKL, ATLAS, No-MKL, GotoBlas2):

  • OS: Ubuntu Lucid 10.4 64 Bit.
  • CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
  • RAM: 24 GB
  • Intel Compiler: 11.1/073
  • Scipy: 0.8
  • Numpy: 1.5

Mac Book Pro (Accelerate Framework):

  • OS: Mac OS X Snow Leopard (10.6)
  • CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
  • RAM: 4 GB
  • Scipy: 0.7
  • Numpy: 1.3

Mac Server (Accelerate Framework):

  • OS: Mac OS X Snow Leopard Server (10.6)
  • CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
  • RAM: 4 GB
  • Scipy: 0.8
  • Numpy: 1.5.1

Dot product benchmark

Code:

import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)

Results:

    System        |  size = 1000  | size = 2000 | size = 3000 |
netlib BLAS       |  1350 ms      |   10900 ms  |  39200 ms   |    
ATLAS (1 CPU)     |   314 ms      |    2560 ms  |   8700 ms   |     
MKL (1 CPUs)      |   268 ms      |    2110 ms  |   7120 ms   |
MKL (2 CPUs)      |    -          |       -     |   3660 ms   |
MKL (8 CPUs)      |    39 ms      |     319 ms  |   1000 ms   |
GotoBlas2 (1 CPU) |   266 ms      |    2100 ms  |   7280 ms   |
GotoBlas2 (2 CPUs)|   139 ms      |    1009 ms  |   3690 ms   |
GotoBlas2 (8 CPUs)|    54 ms      |     389 ms  |   1250 ms   |
Mac OS X (1 CPU)  |   143 ms      |    1060 ms  |   3605 ms   |
Mac Server (1 CPU)|    92 ms      |     714 ms  |   2130 ms   |

Dot product benchmark - chart

Benchmark Suite

Code:
For additional information about the benchmark suite see here.

Results:

    System        | eigenvalues   |    svd   |   det  |   inv   |   dot   |
netlib BLAS       |  1688 ms      | 13102 ms | 438 ms | 2155 ms | 3522 ms |
ATLAS (1 CPU)     |   1210 ms     |  5897 ms | 170 ms |  560 ms |  893 ms |
MKL (1 CPUs)      |   691 ms      |  4475 ms | 141 ms |  450 ms |  736 ms |
MKL (2 CPUs)      |   552 ms      |  2718 ms |  96 ms |  267 ms |  423 ms |
MKL (8 CPUs)      |   525 ms      |  1679 ms |  60 ms |  137 ms |  197 ms |  
GotoBlas2 (1 CPU) |  2124 ms      |  4636 ms | 147 ms |  456 ms |  743 ms |
GotoBlas2 (2 CPUs)|  1560 ms      |  3278 ms | 116 ms |  295 ms |  460 ms |
GotoBlas2 (8 CPUs)|   741 ms      |  2914 ms |  82 ms |  262 ms |  192 ms |
Mac OS X (1 CPU)  |   948 ms      |  4339 ms | 151 ms |  318 ms |  566 ms |
Mac Server (1 CPU)|  1033 ms      |  3645 ms |  99 ms |  232 ms |  342 ms |

Benchmark suite - chart

Installation

Installation of MKL included installing the complete Intel Compiler Suite which is pretty straight forward. However because of some bugs/issues configuring and compiling numpy with MKL support was a bit of a hassle.

GotoBlas2 is a small package which can be easily compiled as a shared library. However because of a bug you have to re-create the shared library after building it in order to use it with numpy.
In addition to this building it for multiple target plattform didn't work for some reason. So I had to create an .so file for each platform for which i want to have an optimized libgoto2.so file.

If you install numpy from Ubuntu's repository it will automatically install and configure numpy to use ATLAS. Installing ATLAS from source can take some time and requires some additional steps (fortran, etc).

If you install numpy on a Mac OS X machine with Fink or Mac Ports it will either configure numpy to use ATLAS or Apple's Accelerate Framework.
You can check by either running ldd on the numpy.core._dotblas file or calling numpy.show_config().

Conclusions

MKL performs best closely followed by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly worse than expected. Not sure why this is the case.
Apple's Accelerate Framework performs really good especially in single threaded mode (compared to the other BLAS implementations).

Both GotoBlas2 and MKL scale very well with number of threads. So if you have to deal with big matrices running it on multiple threads will help a lot.

In any case don't use the default netlib blas implementation because it is way too slow for any serious computational work.

On our cluster I also installed AMD's ACML and performance was similar to MKL and GotoBlas2. I don't have any numbers tough.

I personally would recommend to use GotoBlas2 because it's easier to install and it's free.

If you want to code in C++/C also check out Eigen3 which is supposed to outperform MKL/GotoBlas2 in some cases and is also pretty easy to use.

浴红衣 2024-12-14 02:06:01

我已经运行了您的基准测试。在我的机器上 C++ 和 numpy 没有区别:

woltan's benchmark

您认为我的方法公平吗?或者我可以避免一些不必要的开销吗?

由于结果没有差异,这看起来很公平。

您是否期望结果会显示 C++ 和 Python 方法之间存在如此巨大的差异?两者都使用共享对象进行计算。

不。

由于我更愿意在程序中使用 python,那么在调用 BLAS 或 LAPACK 例程时我可以采取什么措施来提高性能?

确保 numpy 在您的系统上使用 BLAS/LAPACK 库的优化版本。

I've run your benchmark. There is no difference between C++ and numpy on my machine:

woltan's benchmark

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

維他命╮ 2024-12-14 02:06:01

这是另一个基准测试(在 Linux 上,只需输入 make):http: //dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

我愿意对于大型矩阵,Numpy、Ctypes 和 Fortran 之间的不同方法之间没有本质上的区别。 (Fortran 而不是 C++ --- 如果这很重要,那么您的基准测试可能已被破坏。)


C++ 中的 CalcTime 函数似乎存在符号错误。 ... + ((double)start.tv_usec)) 应该改为 ... - ((double)start.tv_usec))
也许你的基准测试还存在其他错误,例如,在不同的 BLAS 库之间进行比较,或在不同的 BLAS 设置(例如线程数)之间进行比较,或者在实时时间和 CPU 时间之间进行比较?

编辑:未能计算 CalcTime 函数中的大括号 - 没关系。

作为指导:如果您进行基准测试,请始终将所有代码发布到某处。在没有完整代码的情况下对基准进行评论,尤其是在令人惊讶的情况下,通常效率不高。


要找出 Numpy 链接到哪个 BLAS,请执行以下操作:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

更新:如果您无法导入 numpy.core._dotblas,则您的 Numpy 正在使用 BLAS 的内部后备副本,速度较慢,不适合用于性能计算!
下面@Woltan 的回复表明,这是他/她在 Numpy 与 Ctypes+BLAS 中看到的差异的解释。

要解决此问题,您需要 ATLAS 或 MKL --- 检查以下说明: http://scipy.org/Installing_SciPy /Linux 大多数 Linux 发行版都附带 ATLAS,因此最好的选择是安装其 libatlas-dev 软件包(名称可能有所不同)。

Here's another benchmark (on Linux, just type make): http://dl.dropbox.com/u/5453551/blas_call_benchmark.zip

http://dl.dropbox.com/u/5453551/blas_call_benchmark.png

I do not see essentially any difference between the different methods for large matrices, between Numpy, Ctypes and Fortran. (Fortran instead of C++ --- and if this matters, your benchmark is probably broken.)


Your CalcTime function in C++ seems to have a sign error. ... + ((double)start.tv_usec)) should be instead ... - ((double)start.tv_usec)).
Perhaps your benchmark also has other bugs, e.g., comparing between different BLAS libraries, or different BLAS settings such as number of threads, or between real time and CPU time?

EDIT: failed to count the braces in the CalcTime function -- it's OK.

As a guideline: if you do a benchmark, please always post all the code somewhere. Commenting on benchmarks, especially when surprising, without having the full code is usually not productive.


To find out which BLAS Numpy is linked against, do:

$ python
Python 2.7.2+ (default, Aug 16 2011, 07:24:41) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy.core._dotblas
>>> numpy.core._dotblas.__file__
'/usr/lib/pymodules/python2.7/numpy/core/_dotblas.so'
>>> 
$ ldd /usr/lib/pymodules/python2.7/numpy/core/_dotblas.so
    linux-vdso.so.1 =>  (0x00007fff5ebff000)
    libblas.so.3gf => /usr/lib/libblas.so.3gf (0x00007fbe618b3000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe61514000)

UPDATE: If you can't import numpy.core._dotblas, your Numpy is using its internal fallback copy of BLAS, which is slower, and not meant to be used in performance computing!
The reply from @Woltan below indicates that this is the explanation for the difference he/she sees in Numpy vs. Ctypes+BLAS.

To fix the situation, you need either ATLAS or MKL --- check these instructions: http://scipy.org/Installing_SciPy/Linux Most Linux distributions ship with ATLAS, so the best option is to install their libatlas-dev package (name may vary).

面如桃花 2024-12-14 02:06:01

鉴于您的分析所表现出的严谨性,迄今为止的结果令我感到惊讶。我将其作为“答案”,但只是因为它对于评论来说太长并且确实提供了可能性(尽管我希望您已经考虑过它)。

我本以为 numpy/python 方法不会为具有合理复杂性的矩阵增加太多开销,因为随着复杂性的增加,python 参与的比例应该很小。我对图表右侧的结果更感兴趣,但显示的数量级差异会令人不安。

我想知道您是否正在使用 numpy 可以利用的最佳算法。来自 Linux 的编译指南:

“构建 FFTW (3.1.2):
SciPy 版本 >= 0.7 和 Numpy >= 1.2:
由于许可证、配置和维护问题,SciPy >= 0.7 和 NumPy >= 1.2 版本中删除了对 FFTW 的支持。现在改为使用 fftpack 的内置版本。
如果您的分析需要的话,有几种方法可以利用 FFTW 的速度。
降级到包含支持的 Numpy/Scipy 版本。
安装或创建您自己的 FFTW 包装器。请参阅 http://developer.berlios.de/projects/pyfftw/ 作为未经认可的例如。”

您是否使用 mkl 编译了 numpy?(http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974(尽管有url),关键部分是:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

如果你在Windows上,你可以获得一个使用 mkl 编译二进制文件(并且还获得 pyfftw 和许多其他相关算法)位于: http://www.lfd.uci.edu/~gohlke/pythonlibs/,其中感谢加州大学欧文分校荧光动力学实验室的 Christoph Gohlke。

警告,无论哪种情况,都有许多许可问题等需要注意,但英特尔页面对此进行了解释。再说一次,我想你已经考虑过这一点,但是如果你满足许可要求(这在 Linux 上很容易做到),相对于使用简单的自动构建,这将大大加快 numpy 部分的速度,甚至不需要 FFTW。我有兴趣关注此主题并看看其他人的想法。无论如何,非常严谨,问题也很好。感谢您发布它。

Given the rigor you've shown with your analysis, I'm surprised by the results thus far. I put this as an 'answer' but only because it's too long for a comment and does provide a possibility (though I expect you've considered it).

I would've thought the numpy/python approach wouldn't add much overhead for a matrix of reasonable complexity, since as the complexity increases, the proportion that python participates in should be small. I'm more interested in the results on the right hand side of the graph, but orders of magnitude discrepancy shown there would be disturbing.

I wonder if you're using the best algorithms that numpy can leverage. From the compilation guide for linux:

"Build FFTW (3.1.2):
SciPy Versions >= 0.7 and Numpy >= 1.2:
Because of license, configuration, and maintenance issues support for FFTW was removed in versions of SciPy >= 0.7 and NumPy >= 1.2. Instead now uses a built-in version of fftpack.
There are a couple ways to take advantage of the speed of FFTW if necessary for your analysis.
Downgrade to a Numpy/Scipy version that includes support.
Install or create your own wrapper of FFTW. See http://developer.berlios.de/projects/pyfftw/ as an un-endorsed example."

Did you compile numpy with mkl? (http://software.intel.com/en-us/articles/intel-mkl/). If you're running on linux, the instructions for compiling numpy with mkl are here: http://www.scipy.org/Installing_SciPy/Linux#head-7ce43956a69ec51c6f2cedd894a4715d5bfff974 (in spite of url). The key part is:

[mkl]
library_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64
include_dirs = /opt/intel/composer_xe_2011_sp1.6.233/mkl/include
mkl_libs = mkl_intel_lp64,mkl_intel_thread,mkl_core 

If you're on windows, you can obtain a compiled binary with mkl, (and also obtain pyfftw, and many other related algorithms) at: http://www.lfd.uci.edu/~gohlke/pythonlibs/, with a debt of gratitude to Christoph Gohlke at the Laboratory for Fluorescence Dynamics, UC Irvine.

Caveat, in either case, there are many licensing issues and so on to be aware of, but the intel page explains those. Again, I imagine you've considered this, but if you meet the licensing requirements (which on linux is very easy to do), this would speed up the numpy part a great deal relative to using a simple automatic build, without even FFTW. I'll be interested to follow this thread and see what others think. Regardless, excellent rigor and excellent question. Thanks for posting it.

站稳脚跟 2024-12-14 02:06:01

让我为一个有点奇怪的发现做出贡献
结果

我的 numpy 链接到 mkl,如 numpy.show_config() 所示。我不知道使用了什么样的 libblas.so C++/BLAS。我希望有人能告诉我一个方法来解决这个问题。

我认为结果很大程度上取决于所使用的库。我无法孤立 C++/BLAS 的效率。

Let me contribute to a somewhat weird finding
result

My numpy is linked to mkl, as given by numpy.show_config(). I have no idea what kind of libblas.so C++/BLAS was used. I hope someone can tell me a way to figure it out.

I think the results strongly depend on the library was used. I cannot isolate the efficiency of C++/BLAS.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文