线程化时的数据损坏矢量统计库-数学核心库
我刚刚并行化了一个模拟个人行为的 Fortran 例程,并且在使用矢量统计库(数学内核库的一个库)生成随机数时遇到了一些问题。 该程序的结构如下:
program example
...
!$omp parallel do num_threads(proc) default(none) private(...) shared(...)
do i=1,n
call firstroutine(...)
enddo
!$omp end parallel do
...
end program example
subroutine firstroutine
...
call secondroutine(...)
...
end subroutine
subroutine secondroutine
...
VSL calls
...
end subroutine
我使用 Intel Fortran 编译器进行编译,其 makefile 如下所示:
f90comp = ifort
libdir = /home
mklpath = /opt/intel/mkl/10.0.5.025/lib/32/
mklinclude = /opt/intel/mkl/10.0.5.025/include/
exec: Example.o Firstroutine.o Secondroutine.o
$(f90comp) -O3 -fpscomp logicals -openmp -o aaa -L$(mklpath) -I$(mklinclude) Example.o -lmkl_ia32 -lguide -lpthread
Example.o: $(libdir)Example.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Example.f90
Firstroutine.o: $(libdir)Firstroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Firstroutine.f90
Secondroutine.o: $(libdir)Secondroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c -L$(mklpath) -I$(mklinclude) $(libdir)Secondroutine.f90 -lmkl_ia32 -lguide -lpthread
编译时一切正常。 当我运行用它生成变量的程序时,一切似乎都工作正常。 然而,有时(例如每 200-500 次迭代一次),它会在几次迭代中生成疯狂的数字,然后以正常方式再次运行。 我还没有找到这种腐败何时发生的任何模式。
知道为什么会发生这种情况吗?
I've just parallelized a fortran routine that simulates individuals behavior and I've had some problems when generating random numbers with Vector Statistical Library (a library from the Math Kernel Library). The structure of the program is the following:
program example
...
!$omp parallel do num_threads(proc) default(none) private(...) shared(...)
do i=1,n
call firstroutine(...)
enddo
!$omp end parallel do
...
end program example
subroutine firstroutine
...
call secondroutine(...)
...
end subroutine
subroutine secondroutine
...
VSL calls
...
end subroutine
I use the Intel Fortran Compiler for the compilation with a makefile that looks as follows:
f90comp = ifort
libdir = /home
mklpath = /opt/intel/mkl/10.0.5.025/lib/32/
mklinclude = /opt/intel/mkl/10.0.5.025/include/
exec: Example.o Firstroutine.o Secondroutine.o
$(f90comp) -O3 -fpscomp logicals -openmp -o aaa -L$(mklpath) -I$(mklinclude) Example.o -lmkl_ia32 -lguide -lpthread
Example.o: $(libdir)Example.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Example.f90
Firstroutine.o: $(libdir)Firstroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Firstroutine.f90
Secondroutine.o: $(libdir)Secondroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c -L$(mklpath) -I$(mklinclude) $(libdir)Secondroutine.f90 -lmkl_ia32 -lguide -lpthread
At compilation time everything works fine. When I run my program generating variables with it, everything seems to work fine. However, from time to time (say once each 200-500 iterations), it generates crazy numbers for a couple of iterations and then runs again in a normal way. I have not found any patern to when does this corruption happen.
Any idea on why is it happening?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
随机数代码要么在内部使用全局变量,要么所有线程使用相同的生成器。 最终,两个线程将尝试同时更新同一块内存,结果将是不可预测的。
因此,您必须为每个线程分配一个随机数生成器。
解决方案:使用信号量/锁保护对随机例程的调用。
The random number code is either using a global variable internally or all threads use the same generator. Eventually, two threads will try to update the same piece of memory at the same time and the result will be non-predictable.
So you must allocate one random number generator per thread.
Solution: Protect the call to the random routine with a semaphore/lock.
我得到了解决方案! 我正在修改从文件中获取的某些值生成的伪随机数。 有时,多个线程尝试读取同一文件并产生损坏。 为了解决这个问题,我添加了一个 omp 关键部分并且它起作用了。
I got the solution! I was modifying the pseudo-random numbers generated by some values taken from a file. From time to time, more than one thread tried to read the same file and generated the corruption. To solve this, I added a omp critical section and it worked.