python中的每个线程使用numba平行时间
当我使用Numba的NJIT并行运行此程序时,我注意到使用许多线程并没有区别。实际上,从1-5个线程开始的时间更快(预期),但是之后的时间变慢了。为什么会发生这种情况?
from numba import njit,prange,set_num_threads,get_num_threads
import numpy as np
@njit(parallel=True)
def test(x,y):
z=np.empty((x.shape[0],x.shape[0]),dtype=np.float64)
for i in prange(x.shape[0]):
for j in range(x.shape[0]):
z[i,j]=x[i,j]*y[i,j]
return z
x=np.random.rand(10000,10000)
y=np.random.rand(10000,10000)
for i in range(16):
set_num_threads(i+1)
print("Number of threads :",get_num_threads())
%timeit -r 1 -n 10 test(x,y)
Number of threads : 1
234 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 2
178 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 3
168 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 4
161 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 5
148 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 6
152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 7
152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 8
153 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 9
154 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 10
156 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 11
158 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 12
157 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 13
158 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 14
160 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 15
160 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 16
161 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
我在带有8个内核和16个线程的CPU的Jupyter笔记本(Anaconda)中对此进行了测试。
When I run this program in parallel using njit from numba, I noticed that using many threads does not make a difference. In fact, from 1-5 threads the time is faster (which is expected) but after that the time gets slower. Why is this happening?
from numba import njit,prange,set_num_threads,get_num_threads
import numpy as np
@njit(parallel=True)
def test(x,y):
z=np.empty((x.shape[0],x.shape[0]),dtype=np.float64)
for i in prange(x.shape[0]):
for j in range(x.shape[0]):
z[i,j]=x[i,j]*y[i,j]
return z
x=np.random.rand(10000,10000)
y=np.random.rand(10000,10000)
for i in range(16):
set_num_threads(i+1)
print("Number of threads :",get_num_threads())
%timeit -r 1 -n 10 test(x,y)
Number of threads : 1
234 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 2
178 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 3
168 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 4
161 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 5
148 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 6
152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 7
152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 8
153 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 9
154 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 10
156 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 11
158 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 12
157 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 13
158 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 14
160 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 15
160 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Number of threads : 16
161 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
I tested this in a Jupyter Notebook (anaconda) in a cpu with 8 cores and 16 threads.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该代码是内存的,因此RAM仅被少量内核饱和。
实际上,
z [i,j] = x [i,j]*y [i,j]
导致两个内存负载8个字节,一个商店8个字节和额外的8个字节的额外负载由于 Write-Alter分配高速缓存在X86-64处理器上(在这种情况下,必须读取书面的高速缓存线)。这意味着每个循环迭代加载/存储的32个字节,而只需要完成1个乘法。现代主流(X86-64)处理器可以执行2x4双精度FP乘法/周期,并在3-5 GHz(实际上,Intel Server处理器可以执行2x8 DP FP乘以/周期)。同时,优质的主流PC只能达到40-60 GIB/s和高性能服务器200-350 GIB/s。在NUMBA中,无法加快像这样的内存绑定代码。 C/C ++代码可以通过避免写入(最多快1.33倍)来稍微改善这一点。最好的解决方案是在可能的情况下在较小的块上操作并合并计算步骤,以便每个步骤应用更多的FP操作。
实际上,与处理器的计算能力相比,RAM的速度会缓慢增加。几十年前已经确定了这个问题,随着时间的流逝,两者之间的差距仍在越来越大。此问题称为“ 记忆墙”。将来不会变得更好(至少不太可能发生这种情况)。
The code is memory-bound so the RAM is saturated with only few cores.
Indeed,
z[i,j]=x[i,j]*y[i,j]
cause two memory load of of 8 bytes, one store of 8 byte and an additional load of 8 bytes due to the write-allocate cache-policy on x86-64 processors (a written cache line must be read in this case). This means 32 bytes loaded/stored per loop iteration while only 1 multiplication need to be done. Modern mainstream (x86-64) processors can do 2x4 double-precision FP multiplications/cycle and operate at 3-5 GHz (in fact, Intel server processor can do 2x8 DP FP multiplications/cycle). Meanwhile a good mainstream PC can only reach 40-60 GiB/s and a high-performance server 200-350 GiB/s.There is no way to speed up memory bound code like this in Numba. C/C++ code can improve this a bit by avoiding write-allocates (up to 1.33 times faster). The best solution is to operate on smaller blocks if possible and merge computing steps so to apply more FP operations per step.
Actually, the speed of the RAM is known to increase slowly compared to the computing power of processors. This problem has been identified few decades ago and the gap between the two is still getting bigger over time. THis problem is known as the "Memory wall". This is not gonna be better in the future (at least it is very unlikely to be the case).