各位程序员大家好。我已经问过 一个问题,但尽管我得到了很好的答案,但我无法解决我的问题。
然后,我花时间重构我的代码,以提高其并行化潜力(通过减少计算批次,每个批次增加计算任务)。但我仍然无法获得比串行处理更好的性能。
我怀疑这种缓慢的并行处理是由于上下文切换造成的。或者可能是由于公共对象的“自动”同步。我想你可以帮助我了解发生了什么事。
让我陈述一下我的情况:我正在编写一个用于科学计算的程序。它不依赖于外部事物,只依赖于我在开始时给它的输入值。
这个问题的大小可以用Ns
(这是我使用的名称)来衡量。它可以看作解决方案的“分辨率”,它是用户输入之一,通常是 100 左右。
这样,我的主类中有几个双精度数组,例如 double ys [Ns][N] 或 phiS[Ns][Nord][N]
,其中 N 和 Nord 是程序的其他固定量值。
在我的程序中,我必须为每个 Ns
点计算几项内容,这就是并行化。每个点的计算都是独立的,所以我可以将它们划分到不同的线程,并希望它变得更快。
因此,我没有使用循环 for (int i=0; i,而是将此计算任务划分为可运行的批次,每个批次都在较小的间隔内:for (int i=start; i,其中 start 和 end 始终在 0 和 Ns 之间。例如,如果我使用双核电脑,我会制作两批,一批使用 start = 0
和 end = Ns/2
,另一批使用 开始 = Ns/2 和 结束 = Ns
。如果我使用四核,第二批将有 start = Ns/4
到 end = Ns/2
等等(假设每个除法都是精确的)案件)。
每个 Batch 作为实现 Runnable 的类,存储在 ArrayList 中,并分配给大小等于核心数量的 FixedThreadPool
。它使用简单的 CountDown 方案执行批次并等待它们完成。
每个批次都需要从程序的主类访问这些数组上的数据,但它们的访问方式是每个批次仅读取从 yS[start][]
到 yS[ end][]
因此两个批次永远不会尝试读取相同的数组元素。我想知道 Java 是否仍然锁定 yS,即使每个批次并不尝试访问与其他批次相同的元素。
我还想知道我的问题是否与上下文切换引起的开销有关,因为每个批次需要处理数千个双精度数,以及程序的构建方式是否会影响它。
也许我应该找到一种方法将与其相关的数组元素传递给每个批次,但我不知道如何解决这个问题。如果有指针,我可以通过简单的指针操作获得仅包含所需元素的新数组,而无需重新分配任何内容。有没有办法在Java中做这样的事情?
好吧,最后,只是提一下:代码的一部分需要同步(它处理其他数组)并且它已经工作正常。
我上面描述的计算任务并不是我的程序所做的唯一事情。它们位于循环内,与顺序处理部分交替,但对于总执行时间来说确实很重要。
所以,总而言之,问题是:为什么我在多线程方面没有取得预期的成果?
我刚刚在这里运行了几次普通串行和多线程程序,串行运行时间为 14500 毫秒,多线程运行时间为 15651 毫秒。两者都在同一个双核上。
其他需要注意的一点:在串行运行中,每个计算任务(从0到Ns)大约需要1.1到4.5 ms。
从双线程来看,每个批次(Ns/2点)大约需要0.5到3毫秒;
(从run()方法的顶部到底部测量。每次计算任务因其自身的数值收敛而不同)
非常感谢您的关注。
Hello fellow programmers. I have already asked one question, but despite the really good answers I've got I couldn't fix my problem.
Then, I took the time to refactor my code in such a way that would improve its parallelization potential (by having less calculation batches with more calculation duty each). But still I can't have a better performance than serial processing.
I suspect this slow parallel processing is due to the context switching. Or maybe it's due to "automatic" synchronization of common objects. I think you can help me understand what's going on.
Let me state my case: I'm making a program for scientific calculations. It does not depends on external things, just on the input values I give to it at its start.
The size of this problem can be measured by Ns
(which is the name I use). It can be seen as the "resolution" of the solution, it is one of the user inputs, and usually is of the order of 100.
In such way, I have several double arrays in my main class such as double ys[Ns][N]
or phiS[Ns][Nord][N]
, where N and Nord are other fixed magnitudes of the program.
In my program, I have to calculate several things for each one of the Ns
points and here comes the parallelization. Each point calculation is independent, so I can divide them to different threads and hope it gets faster.
So, instead of having a loop for (int i=0; i<Ns; <i++)
I divided this calculation duty into Runnable batches, each one ranging inside a smaller interval: for (int i=start; i<end; i++)
, where start and end are allways between 0 and Ns. For example, if I'm on a dual core pc, I make two batches, one with start = 0
and end = Ns/2
, the other with start = Ns/2
and end = Ns
. If I'm on a quad core, the second batch will have start = Ns/4
to end = Ns/2
and so on (assuming the division is exact at every case).
Each Batch, as a class that implements Runnable, is stored in a ArrayList<Batch>
and is given to a FixedThreadPool
with size equal to the number of cores. It execute the batches and waits for them to finish using a simple CountDown
scheme.
Each of this batches needs to access the data on those arrays from the main class of the program, but their access is such that each batch only reads from yS[start][]
to yS[end][]
and therefore two batches will never try to read the same array element. I wonder if Java still locks up yS, even that each batch isn't trying to access the same elements as others.
I wonder also if my problem is related to the overhead due to context switching, as each batch needs to deal with thousands of doubles, and if the way that the program is built can affect it.
Maybe I should find a way to pass to each batch just the elements of the arrays that are relevant to it, but I wouldn't know how to approach this. If there were pointers, I could have new arrays of just the desired elements with simple pointer operations and without reallocating anything. Is there a way to do such a thing in Java?
Well, finally, just to mention: There is one part of the code that needs to be synchronized (it deals with other arrays) and it is already working fine.
This calculation duties I've described above aren't the only thing my program does. They are inside a loop, alternating with sequential processing parts, but are really significant as the total execution time.
So, to summarize, the question is: why I'm not gaining with multithreading, when I was expecting to?
I've just run here a couple of times the plain serial and the multithread program and got 14500 ms for the serial and 15651 ms for the multithread. Both on the same Dual Core.
Other point to notice: In serial run, each calculation duty (from 0 to Ns) takes around 1.1 to 4.5 ms.
From the dual threading, each batch (Ns/2 points) takes around 0.5 to 3 ms;
(measured from the the top to bottom of the run() method. Each time of calculation duty differs by it's own numerical convergence)
Thanks very much for the attention.
发布评论
评论(4)
您可能遇到的一种可能是线程在缓存行上颠簸。如果不同的线程快速写入同一缓存行中的位置(例如,在同一数组中接近),则硬件的通信开销很高,以确保数据保持一致。
One possible you may be running in to is threads thrashing over cache lines. If different threads rapidly write to locations in the same cache line (for instance, close in the same array), then the hardware has a high communication overhead ensuring that the data remains consistent.
Java 中没有自动同步或锁定。您必须明确地编写代码。
上下文切换确实有开销。如果所有线程都处理同一任务(这是 CPU 密集型任务),那么线程数应等于可用处理器核心数。
Java 中的所有对象都是通过引用传递的(例如,当您将它们传递给方法时)。基本上所有引用都是指针(不同之处在于你不能取消引用它们)。因此,除非代码明确请求,否则 Java 中不会复制任何对象。
话虽这么说,您应该注意另一件事:如果您向集合(列表、HashMap 等)添加大量元素,则该集合需要增长。在内部,所有集合都使用数组来存储元素,当添加元素时,需要调整数组的大小。由于 Java 中无法调整数组的大小,因此需要创建一个新数组,并将所有对旧对象的引用复制到新数组中。或者,如果您使用原始类型,则需要复制所有数据。因此,在创建集合时,您应该将它们调整为适当的大小,这样就不需要调整它们的大小。
您可能还想阅读 我应该使用多少个线程在我的 Java 程序中?
There is no automatic synchronization or locking in Java. You have to explicitly code that.
Context switches do have overhead. If all your threads work on the same task, which is CPU-intensive, then your number of threads should equal to number of available processor cores.
All objects in Java are passed by reference (for example when you pass them to a method). And basically all references are pointers (with a difference that you can not dereference them). So no objects are copied in Java, except when explicitly requested by your code.
That being said, you should be aware of another thing: If you are adding a lot of elements to Collections (Lists, HashMaps, etc..) than this Collections need to grow. Internally all Collections use arrays to store elements, and when elements are added the arrays need to be resized. As there is no way to resize an array in Java, there needs to be created a new array and all references to old objects copied to a new array. Or if you use primitive types all data needs to be copied. So, when creating Collections you should size them to appropriate size so that they wouldn't need to be resized.
You may also like to read How many threads should I use in my Java program?
根据您到目前为止所提到的内容,我将尝试以下操作
比较串行版本和并行版本之间的结果,以增加数组的大小。对于您的问题大小而言,性能差异确实可能微不足道,并且只有在大小变大(即数组大小)时才可能显现出来
为每个可运行对象提供其自己的数组副本。考虑到性能,数组在内存中的布局方式以及访问它们的方式会对性能产生影响。即使您可能有一个二维数组,它也会在内存中串行排列为并发数组列表。因此,如果您在可运行对象之间共享此数组,则对于其中一些可运行对象来说可能会变得低效。
Based on what you've mentioned so far, I would try the following things
Compare results between the serial and the parallel version for increasing sizes for your arrays. Difference in performance may indeed be insignificant for your problem size and may only show itself once the size gets bigger i.e. size of the arrays
Give each runnable its own copy of the array. In light of performance, the way the array is laid out in memory and how you access them can gave a effect on performance. Even though you may have a 2D array, its going to be laid out as a concurrent list of arrays serially in memory. Hence, if you share this array between runnables, it may become inefficient for some of them.
您是否有足够的内存来创建多个集合并将唯一的工作集合传递给每个线程,这样您就可以完全消除多个线程访问同一内存的争用问题?
do you have sufficient memory available to create multiple collections and pass a unique collection of work to each thread, this way you can absolutely take the contention of multiple threads accessing the same memory out of your mind?