C++ 的多线程应用程序在分配(取消)内存时互相阻塞

发布于 2025-01-14 23:14:56 字数 1724 浏览 1 评论 0原文

World,

我尝试使用多个线程运行 C++ 应用程序(在 VS 中编译为 .exe),并为此使用 QThread 或 omp 并行化。在使用 umfpack 求解从这些矩阵构建的方程系统之前,每个线程都会执行多次内存分配/释放,以执行大型矩阵计算。现在,当我使用太多线程时,我会降低性能,因为执行此操作时线程会相互阻塞。我已经读到,内存(取消)分配一次只能用于一个线程(如互斥条件)。

我已经尝试过:

  • 尽可能减少大型重新分配 我可以
  • 使用不同的并行化方法(Qt 与 omp)
  • 随机更改保留和提交的堆栈/堆大小
  • 使 umfpack 数组成为线程私有

在我的设置中,我可以使用〜4个线程(在性能下降之前,每个线程使用约 1.5 GB RAM)。有趣的是——但我还无法理解这一点——只有在几个线程完成并且新线程接管之后,性能才会降低。另请注意,线程之间不相互依赖,没有其他阻塞条件,并且每个线程运行的时间大致相同(约 2 分钟)。

是否有一种“简单的方法” - 例如以某种方式设置堆/堆栈 - 来解决这个问题?

以下是一些代码片段:

// Loop to start threads

forever
{
    if (sem.tryAcquire(1)) {
        QThread *t = new QThread();
        connect(t, SIGNAL(started()), aktBer, SLOT(doWork()));
        connect(aktBer, SIGNAL(workFinished()), t, SLOT(quit()));
        connect(t, SIGNAL(finished()), t, SLOT(deleteLater()));
        aktBer->moveToThread(t);
        t->start();
        sleep(1);
    }
    else {
        //... wait for threads to end before starting new ones
        //... eventually break
    }
    qApp->processEvents();
}

void doWork() {
    // Do initial matrix stuff...
    
    // Initializing array pointers for umfpack-lib
        static int *Ap=0;
        static int *Ai=0;
        static int *Ax=0;
        static int *x=0;
        static int *b=0;

    // Private static Variablen per thread
    #pragma omp threadprivate(Ap, Ai, Acol, Arow)

    // Solving -> this is the part where the threads block each other, note, that 
              there are other functions with matrix operations, which also (de-)/allocate a 
              lot
    status = umfpack_di_solve (UMFPACK_A, Ap,Ai,Ax,x,b, /*...*/);
    
    emit(workFinished());
}

World,

I try to run an C++ application (compiled in VS as .exe) with multiple threads and use QThread or omp-parallelization for this. Each thread does multiple allocations/deallocations of memory to perfrom large matrix computations before solving equation systems built from these matrices with umfpack. Now, when I use too many threads, I loose performance because the threads are blocking each other while doing this. I already read that memory (de)-allocation is possible only for one thread at a time (like a mutex condition).

What I have tried already:

  • deacrease large reallocations as best I could
  • use different parallelization methods (Qt vs. omp)
  • randomly changing the reserved and committed stack/heap size
  • making umfpack arrays threadprivate

In my setup, I am able to use ~4 threads (each thread uses ~1.5 GB RAM) before performance decreases. Interestingly - but something I couldn't wrap my head around yet - the performace is reduced only after a couple of threads finished and new ones are taking over. Note also that threads are not depended from each other, there are no other blocking conditions, and each thread runs roughly the same amount of time (~2min).

Is there an "easy way" - e.g. setting up heap/stack in a certain way - to solve this issue?

Here are some code snippets:

// Loop to start threads

forever
{
    if (sem.tryAcquire(1)) {
        QThread *t = new QThread();
        connect(t, SIGNAL(started()), aktBer, SLOT(doWork()));
        connect(aktBer, SIGNAL(workFinished()), t, SLOT(quit()));
        connect(t, SIGNAL(finished()), t, SLOT(deleteLater()));
        aktBer->moveToThread(t);
        t->start();
        sleep(1);
    }
    else {
        //... wait for threads to end before starting new ones
        //... eventually break
    }
    qApp->processEvents();
}

void doWork() {
    // Do initial matrix stuff...
    
    // Initializing array pointers for umfpack-lib
        static int *Ap=0;
        static int *Ai=0;
        static int *Ax=0;
        static int *x=0;
        static int *b=0;

    // Private static Variablen per thread
    #pragma omp threadprivate(Ap, Ai, Acol, Arow)

    // Solving -> this is the part where the threads block each other, note, that 
              there are other functions with matrix operations, which also (de-)/allocate a 
              lot
    status = umfpack_di_solve (UMFPACK_A, Ap,Ai,Ax,x,b, /*...*/);
    
    emit(workFinished());
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

十六岁半 2025-01-21 23:14:56

对于那些对我的解决方案感兴趣的人:我在我的应用程序中包含了另一个分配器(正如 @Ben Voigt 建议的那样)。就我而言,我选择了mimalloc,因为它似乎得到定期维护(甚至由微软本身)并且可以很容易地包含在内。
请参阅此处:Mimalloc

For those who are interested in my solution: I included another allocator in my app (as @Ben Voigt suggested). In my case, I chose mimalloc as it seems to get regular maintanance (even by microsoft itself) and can be included pretty easily.
See here: Mimalloc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文