多线程的内存访问

发布于 2024-09-09 19:15:19 字数 2453 浏览 7 评论 0原文

我正在编写一个在 Nehalem 处理器上运行的多线程 java 应用程序。然而,我有一个问题,从 4 个线程开始,我几乎看不到应用程序的加速。

我做了一些简单的测试。我创建了一个线程,它只分配一个大数组并访问数组中的随机条目。因此,当我运行线程数时,运行时间不应改变(假设我没有超过可用 CPU 核心的数量)。但我观察到,运行 1 或 2 个线程几乎需要相同的时间,但运行 4 或 8 个线程要慢得多。因此,在尝试解决应用程序中的算法和同步问题之前,我想找出可以实现的最大可能并行化。

我使用了 -XX:+UseNUMA JVM 选项,因此数组应该分配在相应线程附近的内存中。

PS 如果线程正在进行简单的数学计算,则 4 个甚至 8 个线程不会出现时间下降,因此我得出的结论是,当线程访问内存时,我遇到了一些问题。

任何帮助或想法表示赞赏,谢谢。


编辑

谢谢大家的回复。我发现我对自己的解释还不够好。

在尝试消除应用程序中的同步问题之前,我做了一个简单的测试,检查可以实现的最佳并行化。代码如下:

public class TestMultiThreadingArrayAccess {
    private final static int arrSize = 40000000;

    private class SimpleLoop extends Thread {
        public void run() {
            int array[] = new int[arrSize];
            for (long i = 0; i < arrSize * 10; i++) {
                array[(int) ((i * i) % arrSize)]++; // randomize a bit the access to the array
            }
            long sum = 0;
            for (int i = 0; i < arrSize; i++)
                sum += array[i];
        }
    }

    public static void main(String[] args) {
        TestMultiThreadingArrayAccess test = new TestMultiThreadingArrayAccess();
        for (int threadsNumber : new int[] { 1, 2, 4, 8 }) {
            Statistics timer = new Statistics("Executing " + threadsNumber+ " threads"); // Statistics is a simple helper class that measures the times
            timer.start();
            test.doTest(threadsNumber);
            timer.stop();
            System.out.println(timer.toString());
        }
    }

    public void doTest(int threadsNumber) {
        Thread threads[] = new Thread[threadsNumber];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new SimpleLoop();
            threads[i].start();
        }

        for (int i = 0; i < threads.length; i++)
            try {
                threads[i].join();
            } catch (InterruptedException e) {
            };
    }
}

因此,正如您所看到的,在这个迷你测试中根本没有同步,并且数组的分配是在线程内部,因此它应该放置在可以快速访问的内存块中。这段代码中也没有内存争用。对于 4 线程来说,运行时间仍然下降了 30%,而 8 线程的运行速度则慢了一倍。正如您从代码中看到的那样,我只是等到所有线程完成其工作,并且由于它们的工作是独立的,线程数量不应影响执行所需的总时间。

机器上安装了 2 个四核超线程 Nehalem 处理器(总共 16 个 CPU),因此每个线程都有 8 个线程,可以独占自己的 CPU。

当我尝试使用较小的数组(20K 条目)运行此测试时,4 个线程的执行时间下降了 7%,8 个线程的执行时间下降了 14%,这令人满意。但是,当我尝试对大型数组(40M 条目)进行随机访问时,运行时间会急剧增加,因此我认为存在以下问题:大块内存(因为它们不适合缓存内存?)以非-有效的方式。

有什么想法可以解决这个问题吗?

希望这能以更好的方式澄清这个问题,再次感谢。

I'm writing a multi threading java application that runs on Nehalem processor. However I have a problem that starting from 4 threads I almost don't see the speedup in my application.

I've made some simple test. I've created a thread that just allocates a big array and making access to random entries in the array. So when I run number of threads the running time shouldn't change (assuming I'm not exceeding number of available CPU cores). But what I observed is that running 1 or 2 threads takes almost the same time, but running 4 or 8 threads is significantly slower. So before trying so solve algorithmic and synchronization problem in my application I want to find out what is maximal possible parallelization I can achieve.

I've used -XX:+UseNUMA JVM option, so the arrays ought to be allocated in the memory near the corresponding threads.

P.S. If the threads were making a simple mathematical calculation there was no time drop for 4 and even 8 threads, so I concluded that when the threads are accessing memory I have some problems.

Any help or ideas are appreciated, thanks.


EDIT

Thanks you all for the replies. I see that I haven't explained myself good enough.

Before trying to eliminate synchronization problems in my application I made a simple test that check best possible parallelization that could be achieved. The code is as follows:

public class TestMultiThreadingArrayAccess {
    private final static int arrSize = 40000000;

    private class SimpleLoop extends Thread {
        public void run() {
            int array[] = new int[arrSize];
            for (long i = 0; i < arrSize * 10; i++) {
                array[(int) ((i * i) % arrSize)]++; // randomize a bit the access to the array
            }
            long sum = 0;
            for (int i = 0; i < arrSize; i++)
                sum += array[i];
        }
    }

    public static void main(String[] args) {
        TestMultiThreadingArrayAccess test = new TestMultiThreadingArrayAccess();
        for (int threadsNumber : new int[] { 1, 2, 4, 8 }) {
            Statistics timer = new Statistics("Executing " + threadsNumber+ " threads"); // Statistics is a simple helper class that measures the times
            timer.start();
            test.doTest(threadsNumber);
            timer.stop();
            System.out.println(timer.toString());
        }
    }

    public void doTest(int threadsNumber) {
        Thread threads[] = new Thread[threadsNumber];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new SimpleLoop();
            threads[i].start();
        }

        for (int i = 0; i < threads.length; i++)
            try {
                threads[i].join();
            } catch (InterruptedException e) {
            };
    }
}

So as you see there is no synchronization at all in this minitest and also the allocation of the array is inside the thread so it should be placed in the chunk of the memory that could be accessed quickly. Also there is no memory contentions in this code. Still for 4 threads there is a drop of 30% in the running time, and 8 threads runs twice slower. As you from the code I just wait until all threads finish theirs work, and since theirs work is independent number of threads shouldn't affect the total time the execution takes.

On the machine installed 2 quad-core hyperthreaded Nehalem processors (totally 16 CPUs), so with 8 threads each one could catch it's CPU exclusively.

When I tried to run this test with smaller array (20K entries) the drop of the execution time of 4 threads was 7% and 8 threads - 14%, which is satisfying. But when I try to operate random accessed on large array (40M entries) running times increase dramatically, so I think that there is problem that big chunks of memory (because they don't fit in the cache memory?) are accessed in a non-efficient way.

Are there any ideas how to fix this?

Hope this clarifies the question in a better way, thanks again.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

南烟 2024-09-16 19:15:20

测试中的瓶颈是CPU到内存的带宽。即使本地内存可用,它也会被一定数量的线程共享。 (内存是节点本地的,而不是特定核心的。)一旦 CPU 很容易超过像上面的测试这样的简单循环的可用带宽,因此在此类测试中增加线程不会提高性能,反而会恶化性能由于缓存一致性恶化。

只是一个理智测试,您是否也在使用并行收集器? -XX:+UseParallelGC。 UseNUMA 才生效。

The bottleneck in the test is the cpu to memory bandwith. Even when local memory is available, it is going to be shared by some number of threads. (The memory is local to a node, not to a specific core.) Once CPU can easily exceed the available bandwidth for a simple loop like your above test, and so increasing threads on such a test will not improve performance, and can worsen performance due to worsened cache coherence.

Just a sanity test, are you also using the parallel collector? -XX:+UseParallelGC. UseNUMA takes effect only then.

薔薇婲 2024-09-16 19:15:20

不知道你到底在做什么以及你试图解决的问题是什么。看起来您的代码周围有大量同步,因为这可能是可扩展性不够的主要原因。一旦过度同步使您的应用程序几乎串行,就会导致速度减慢。因此,我对您的建议是检查您的实施情况并尝试解决这个问题。

添加。

添加您正在执行的操作的实现后。性能下降可以通过大量内存访问来解释。一旦运行所有线程,它们需要访问内存控制器来获取未缓存的数据,因为它们在不同的 CPU 上运行,内存控制器会阻止 CPU 同时执行此操作,这意味着每次缓存未命中时都会在硬件级别进行同步。在你的情况下,它几乎等同于你正在运行 10 个不同的独立程序。例如,我想如果你启动 10 个(你可以用任何大数字替换 10 个)副本你的网络浏览器,你会看到相同的效果,但这并不意味着浏览器实现无效,你只是给浏览器带来了巨大的负担电脑内存。

Without knowing what exactly you are doing and what is the problem your trying to solve. It looks like you have heavy synchronization around your code, since it could be the main reason for not to be scalable enough. Over synchronization cause to slow down any speedup, once it make your application almost serial. So my suggestion to you is to inspect your implementation and trying to figure this out.

ADD.

After you've added your implementation of what you are doing. The downgrade of performance could be explained by large and massive memory access. Once you running all you thread and they need to access memory controller for not cached data, since they running on different CPU's, memory controller prevents CPU's from doing it simultaneously, meaning there is a synchronization at hardware level on each cache miss. In you case it's almost equal as if you were running 10 different independent programs. I guess if you will launch 10 (you can replace 10 by any large number) copies your web browser, for instance, you will see same effect, but this doesn't mean that browser implementation is ineffective, you just create a huge burden on computer memory.

一向肩并 2024-09-16 19:15:20

正如 Artem 所说,您可能进行了不必要的同步。但我首先要确定事实。您的应用程序真的像您所描述的那样运行速度较慢吗?

这是关于该主题的一篇富有洞察力的文章: http://codeidol.com/java/java-concurrency/Testing-Concurrent-Programs/Avoiding-Performance-Testing-Pitfalls/

编写有用的微基准测试实际上非常困难,尤其是在处理并发代码时。例如,您可以进行“死代码消除”,其中编译器会优化您认为正在执行的代码。也很难猜测垃圾收集何时运行。 Hotspot 的运行时优化也使测量变得更加困难。对于线程,您需要考虑用于创建它们的时间。因此,您可能需要使用“CyclicBarrier”等来进行准确的测量。诸如此类的事情......

话虽如此,我发现如果你所做的只是阅读,那么你在访问内存时会遇到问题。如果您可以发布代码,我们可能可以更好地帮助您...

As Artem notes, it is possible that you have unnecessary synchronization. But I'd start by establishing the facts. Is your app REALLY running slower as you describe?

Here is an insightful article on the subject: http://codeidol.com/java/java-concurrency/Testing-Concurrent-Programs/Avoiding-Performance-Testing-Pitfalls/

It's actually quite tough to write useful micro benchmarks, especially when you are dealing with concurrent code. For example, you can have "Dead code elimination" in which the compiler optimizes the code away you think is being executed. It's also difficult to guess when garbage collection runs. Hotspot's runtime optimization makes also measurement more difficult. In case of threads, you need to take in account the time that is used to create them. So you might need to use a `CyclicBarrier` etc. to have accurate measurement. Things like that..

Having said that, I find it hard that you'll have issues in accessing memory if all you are doing is reading. We might be able to help you better if you can post the code...

梦在夏天 2024-09-16 19:15:20

我想到了两个明显的潜在问题。

  • 使用更多的线程会分配更多的数组,从而破坏缓存。对主内存或较低级别缓存的访问要慢得多。
  • 如果您使用相同的随机数生成器实例源,则线程将争夺对其的访问权。它可能不是完全同步,而是使用无锁算法的内存屏障。一般来说,无锁算法虽然通常很快,但在高竞争情况下会变慢。

There are two obvious potential problems that spring to mind.

  • Using more thread allocates more arrays which burst the cache. Accesses to main memory or lower levels of cache are much slower.
  • If you are using the same source of instance of random number generator, then threads will be fighting over access to it. It might not be full synchronisation, but instead memory barriers with a lock-free algorithm. Generally lock-free algorithms, although generally fast, get much slower under high contention.
风尘浪孓 2024-09-16 19:15:20

除了并发问题之外,导致速度变慢的最可能原因是内存缓存争用。

如果所有线程都访问同一块存储,那么当您想要访问它时,它很可能位于其他处理器的内存缓存中。

如果存储是“只读”的,您可以为每个线程提供自己的副本,这将允许 JVM 和其他线程使用它。处理器优化内存访问。

Aside from concurrency problems the most likely cause of your slowup is memory cache contention.

If all threads are accessing the same piece of storage the chances are its in the other processers memory cache when you want to access it.

If the storage is "read only" you could give each thread its own copy which would allow the JVM & processor to optimise the memory acccesses.

一袭水袖舞倾城 2024-09-16 19:15:20

我根据我发布的文章中的建议修改了您的测试。在我的 2 核机器上(这就是我现在所拥有的)结果似乎是合理的(请注意,我为每个线程号运行了 2 次测试):

也许你可以尝试这个?
(请注意,我必须稍微修改您的测试(请参阅评论),因为在我较差的硬件上运行需要很长时间)

另请注意,我使用 -server 选项运行此测试。

Test with threadNum 1 took 2095717473 ns
Test with threadNum 1 took 2121744523 ns
Test with threadNum 2 took 2489853040 ns
Test with threadNum 2 took 2465152974 ns
Test with threadNum 4 took 5044335803 ns
Test with threadNum 4 took 5041235688 ns
Test with threadNum 8 took 10279012556 ns
Test with threadNum 8 took 10347970483 ns

代码:

import java.util.concurrent.*;

public class Test{
    private final static int arrSize = 20000000;

    public static void main(String[] args) throws Exception {
        int[] nums = {1,1,2,2,4,4,8,8};//allow hotspot optimization
        for (int threadNum : nums) {
            final CyclicBarrier gate = new CyclicBarrier(threadNum+1);
            final CountDownLatch latch = new CountDownLatch(threadNum);
            ExecutorService exec = Executors.newFixedThreadPool(threadNum);
            for(int i=0; i<threadNum; i++){
                Runnable test = 
                  new Runnable(){
                     public void run() {
                         try{
                             gate.await();
                         }catch(Exception e){
                             throw new RuntimeException(e);
                         }
                         int array[] = new int[arrSize];
                         //arrSize * 10 took very long to run so made it
                         // just arrSize.
                         for (long i = 0; i < arrSize; i++) {
                             array[(int) ((i * i) % arrSize)]++;
                         }//for
                         long sum = 0;
                         for (int i = 0; i < arrSize; i++){
                              sum += array[i]; 
                         }
                         if(new Object().hashCode()==sum){
                              System.out.println("oh");
                         }//if
                         latch.countDown();
                      }//run
                   };//test
                exec.execute(test);
             }//for
             gate.await();
             long start = System.nanoTime();
             latch.await();
             long finish = System.nanoTime();
             System.out.println("Test with threadNum " +
                 threadNum +" took " + (finish-start) + " ns ");
             exec.shutdown();
             exec.awaitTermination(Long.MAX_VALUE,TimeUnit.SECONDS);           
        }//for
    }//main

}//Test

I modified your test with the advice from the article I posted. On my 2 core machine (that's all I have right now) result seems reasonable (note that I ran 2 tests for each thread number):

Maybe you can try this?
(Please note that I had to modify your test slightly (see comment) because it took very long to run on my poor hardware)

Also note that I run this test using the -server option.

Test with threadNum 1 took 2095717473 ns
Test with threadNum 1 took 2121744523 ns
Test with threadNum 2 took 2489853040 ns
Test with threadNum 2 took 2465152974 ns
Test with threadNum 4 took 5044335803 ns
Test with threadNum 4 took 5041235688 ns
Test with threadNum 8 took 10279012556 ns
Test with threadNum 8 took 10347970483 ns

code:

import java.util.concurrent.*;

public class Test{
    private final static int arrSize = 20000000;

    public static void main(String[] args) throws Exception {
        int[] nums = {1,1,2,2,4,4,8,8};//allow hotspot optimization
        for (int threadNum : nums) {
            final CyclicBarrier gate = new CyclicBarrier(threadNum+1);
            final CountDownLatch latch = new CountDownLatch(threadNum);
            ExecutorService exec = Executors.newFixedThreadPool(threadNum);
            for(int i=0; i<threadNum; i++){
                Runnable test = 
                  new Runnable(){
                     public void run() {
                         try{
                             gate.await();
                         }catch(Exception e){
                             throw new RuntimeException(e);
                         }
                         int array[] = new int[arrSize];
                         //arrSize * 10 took very long to run so made it
                         // just arrSize.
                         for (long i = 0; i < arrSize; i++) {
                             array[(int) ((i * i) % arrSize)]++;
                         }//for
                         long sum = 0;
                         for (int i = 0; i < arrSize; i++){
                              sum += array[i]; 
                         }
                         if(new Object().hashCode()==sum){
                              System.out.println("oh");
                         }//if
                         latch.countDown();
                      }//run
                   };//test
                exec.execute(test);
             }//for
             gate.await();
             long start = System.nanoTime();
             latch.await();
             long finish = System.nanoTime();
             System.out.println("Test with threadNum " +
                 threadNum +" took " + (finish-start) + " ns ");
             exec.shutdown();
             exec.awaitTermination(Long.MAX_VALUE,TimeUnit.SECONDS);           
        }//for
    }//main

}//Test
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文