由于使用事件而产生的开销

发布于 2024-08-02 08:27:29 字数 2682 浏览 12 评论 0原文

我有一个自定义线程池类，它创建一些线程，每个线程等待自己的事件（信号）。当新作业添加到线程池时，它会唤醒第一个空闲线程，以便它执行该作业。

问题如下：我有大约 1000 个循环，每个循环大约有 10'000 次迭代。这些循环必须按顺序执行，但我有 4 个可用的 CPU。我尝试做的是将 10'000 次迭代循环拆分为 4 个 2'500 次迭代循环，即每个线程一个。但我必须等待 4 个小循环完成才能进入下一个“大”迭代。这意味着我无法捆绑这些工作。

我的问题是，使用线程池和 4 个线程比顺序执行作业要慢得多（由单独的线程执行一个循环比直接在主线程中顺序执行要慢得多）。

我使用的是 Windows，因此我使用 CreateEvent() 创建事件，然后使用 WaitForMultipleObjects(2, handlers, false, INFINITE) 等待其中一个事件，直到主线程调用SetEvent()。

看来整个事件（以及使用关键部分的线程之间的同步）非常昂贵！

我的问题是：使用事件花费“大量”时间是否正常？如果是这样，我是否可以使用另一种机制，并且时间成本会更少？

这是一些代码来说明（从我的线程池类复制的一些相关部分）：

// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
    // some housekeeping
    HANDLE signals[2];
    signals[0] = waitSignal;
    signals[1] = endSignal;

    do {
        // wait for one of the signals
        waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);

        // try to get the next job parameters;
        if (tp->getNextJob(threadId, data)) {
            // execute job
            void* output = jobFunc(data.params);

            // tell thread pool that we're done and collect output
            tp->collectOutput(data.ID, output);
        }

        tp->threadDone(threadId);
    }
    while (waitResult - WAIT_OBJECT_0 == 0);

    // if we reach this point, endSignal was sent, so we are done !

    return 0;
}

// create all threads
for (int i = 0; i < nbThreads; ++i) {
    threadData data;
    unsigned int threadId = 0;
    char eventName[20];

    sprintf_s(eventName, 20, "WaitSignal_%d", i);

    data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
        this, CREATE_SUSPENDED, &threadId);
    data.threadId = threadId;
    data.busy = false;
    data.waitSignal = CreateEvent(NULL, true, false, eventName);

    this->threads[threadId] = data;

    // start thread
    ResumeThread(data.handle);
}

// add job
void ThreadPool::addJob(int jobId, void* params) {
    // housekeeping
    EnterCriticalSection(&(this->mutex));

    // first, insert parameters in the list
    this->jobs.push_back(job);

    // then, find the first free thread and wake it
    for (it = this->threads.begin(); it != this->threads.end(); ++it) {
        thread = (threadData) it->second;

        if (!thread.busy) {
            this->threads[thread.threadId].busy = true;

            ++(this->nbActiveThreads);

            // wake thread such that it gets the next params and runs them
            SetEvent(thread.waitSignal);
            break;
        }
    }

    LeaveCriticalSection(&(this->mutex));
}

原文

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.

The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.

My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).

I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().

It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !

My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?

Here is some code to illustrate (some relevant parts copied from my thread pool class) :

// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
    // some housekeeping
    HANDLE signals[2];
    signals[0] = waitSignal;
    signals[1] = endSignal;

    do {
        // wait for one of the signals
        waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);

        // try to get the next job parameters;
        if (tp->getNextJob(threadId, data)) {
            // execute job
            void* output = jobFunc(data.params);

            // tell thread pool that we're done and collect output
            tp->collectOutput(data.ID, output);
        }

        tp->threadDone(threadId);
    }
    while (waitResult - WAIT_OBJECT_0 == 0);

    // if we reach this point, endSignal was sent, so we are done !

    return 0;
}

// create all threads
for (int i = 0; i < nbThreads; ++i) {
    threadData data;
    unsigned int threadId = 0;
    char eventName[20];

    sprintf_s(eventName, 20, "WaitSignal_%d", i);

    data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
        this, CREATE_SUSPENDED, &threadId);
    data.threadId = threadId;
    data.busy = false;
    data.waitSignal = CreateEvent(NULL, true, false, eventName);

    this->threads[threadId] = data;

    // start thread
    ResumeThread(data.handle);
}

// add job
void ThreadPool::addJob(int jobId, void* params) {
    // housekeeping
    EnterCriticalSection(&(this->mutex));

    // first, insert parameters in the list
    this->jobs.push_back(job);

    // then, find the first free thread and wake it
    for (it = this->threads.begin(); it != this->threads.end(); ++it) {
        thread = (threadData) it->second;

        if (!thread.busy) {
            this->threads[thread.threadId].busy = true;

            ++(this->nbActiveThreads);

            // wake thread such that it gets the next params and runs them
            SetEvent(thread.waitSignal);
            break;
        }
    }

    LeaveCriticalSection(&(this->mutex));
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤寂小茶 2024-08-09 08:27:29

在我看来，这是一种生产者消费者模式，它可以用两个信号量来实现，一个信号量保护队列溢出，另一个信号量保护空队列。

您可以在此处找到一些详细信息。

回复收藏 0 原文

戴着白色围巾的女孩 2024-08-09 08:27:29

是的，WaitForMultipleObjects 非常昂贵。如果您的作业很小，同步开销将开始超过实际执行作业的成本，如您所见。

解决这个问题的一种方法是将多个作业捆绑为一个：如果您得到了一份“小”作业（无论您如何评估此类事情），请将其存储在某个地方，直到您有足够的小作业一起组成一个合理大小的作业。然后将它们全部发送到工作线程进行处理。

或者，您可以使用多读取器单写入器队列来存储作业，而不是使用信号发送。在此模型中，每个工作线程都尝试从队列中获取作业。当它找到一个时，它就会完成工作；如果没有，它会休眠一小段时间，然后醒来并重试。这会降低每个任务的开销，但即使没有工作要做，线程也会占用 CPU。这完全取决于问题的确切性质。

回复收藏 0 原文

小巷里的女流氓 2024-08-09 08:27:29

请注意，在发出 endSignal 后您仍在请求下一份工作。

for( ;; ) {
    // wait for one of the signals
    waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
    if( waitResult - WAIT_OBJECT_0 != 0 )
        return;
    //....
}

Watch out, you are still asking for a next job after the endSignal is emitted.

for( ;; ) {
    // wait for one of the signals
    waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
    if( waitResult - WAIT_OBJECT_0 != 0 )
        return;
    //....
}

回复收藏 0 原文

深陷 2024-08-09 08:27:29

既然您说并行执行比顺序执行慢得多，我假设您的内部 2500 次循环迭代的处理时间很小（在几微秒范围内）。那么除了检查算法以分割更大的处理块之外，您无能为力； OpenMP 不会有帮助，所有其他同步技术也不会有帮助，因为它们从根本上来说都依赖于事件（自旋循环不符合条件）。

另一方面，如果 2500 次循环迭代的处理时间大于 100 微秒（在当前 PC 上），则可能会遇到硬件限制。如果您的处理使用大量内存带宽，将其拆分为四个处理器不会给您带来更多带宽，实际上会因为冲突而给您带来更少的带宽。您还可能遇到缓存循环问题，其中前 1000 次迭代中的每一次都会刷新并重新加载 4 个核心的缓存。那么就没有一种解决方案，并且根据您的目标硬件，可能没有一种解决方案。

回复收藏 0 原文

殤城〤 2024-08-09 08:27:29

如果您只是并行化循环并使用 vs 2008，我建议您查看 OpenMP。如果您使用的是 Visual Studio 2010 beta 1，我建议您查看并行模式库，特别是 “并行”/“并行每个”
api 或"任务组类，因为这些可能会做你想要做的事情，只是用更少的代码。

关于你关于性能的问题，这实际上取决于你在迭代期间安排了多少工作以及什么。如果你经常使用它并且你的工作量很小，那么 WaitForMultipleObjects 的成本可能会非常昂贵，这就是为什么我建议使用已经构建的实现，并且你还需要确保你没有在调试模式下运行。任务本身不会阻塞锁、I/O 或内存分配，并且您不会遇到错误共享，

我建议在类似 < 的探查器下查看此问题。 a href="http://msdn.microsoft.com/en-us/performance/default.aspx" rel="nofollow noreferrer">xperf Visual Studio 2010 beta 1 中的 f1 分析器（它有 2 个新功能）有助于查看争用的并发模式）或英特尔的 vtune。

您还可以共享您在任务中运行的代码，这样人们就可以更好地了解您在做什么，因为我总是在性能问题上得到的答案首先是“这取决于”，其次是“您有吗？”对其进行了简介。”

祝你好运

-瑞克

回复收藏 0 原文

━╋う一瞬間旳綻放 2024-08-09 08:27:29

它不应该那么昂贵，但是如果您的工作几乎不需要任何时间，那么线程和同步对象的开销将变得很大。像这样的线程池对于处理时间较长的作业或使用大量 IO 而不是 CPU 资源的作业来说效果更好。如果处理作业时受 CPU 限制，请确保每个 CPU 仅具有 1 个线程。

可能还有其他问题，getNextJob如何获取它的数据来处理？如果有大量数据复制，那么您的开销又会显着增加。

我会通过让每个线程不断从队列中拉出作业直到队列为空来优化它。这样，您可以将一百个作业传递到线程池，并且同步对象将仅使用一次来启动线程。我还将作业存储在队列中，并将它们的指针、引用或迭代器传递给线程，而不是复制数据。

回复收藏 0 原文

唐婉 2024-08-09 08:27:29

线程之间的上下文切换也可能很昂贵。在某些情况下，开发一个可用于通过一个线程或多线程顺序处理作业的框架是很有趣的。这样您就可以两全其美。

顺便问一下，你的问题到底是什么？我将能够用更精确的问题来更准确地回答:)

编辑：

在某些情况下，事件部分可能会比您的处理消耗更多，但不应该那么昂贵，除非您的处理速度非常快。在这种情况下，线程之间的切换也很昂贵，因此我的回答第一部分是按顺序做事......

您应该寻找线程间同步瓶颈。您可以跟踪线程等待时间开始...

编辑：更多提示后...

如果我猜对了，您的问题是有效地使用所有计算机核心/处理器来并行化一些本质上连续的处理。

假设您有 4 个核心和 10000 个循环来计算，如您的示例（在评论中）所示。你说需要等4个线程结束才可以继续。然后您可以简化同步过程。你只需要给你的四个线程 thr nth, nth+1, nth+2, nth+3 循环，等待四个线程完成然后继续。您应该使用集合点或屏障（等待 n 个线程完成的同步机制）。 Boost 具有这样的机制。您可以查看 windows 实现以提高效率。您的线程池并不真正适合该任务。在关键部分中搜索可用线程会消耗您的 CPU 时间。不是活动部分。

回复收藏 0 原文

请别遗忘我 2024-08-09 08:27:29

看来这整个事件的事情
（随着同步
线程之间使用关键
部分）相当昂贵！

“贵”是一个相对词。喷气式飞机很贵吗？是汽车吗？或者自行车...鞋子...？

在这种情况下，问题是：相对于 JobFunction 执行所需的时间，事件是否“昂贵”？发布一些绝对数字将有所帮助：“无线程”时该过程需要多长时间？是几个月，还是几飞秒？

当您增加线程池大小时，时间会发生什么变化？尝试池大小为 1，然后是 2，然后是 4，等等。

另外，由于您过去在线程池方面遇到过一些问题，我建议进行一些调试
计算您的线程函数实际被调用的次数...它符合您的预期吗？

凭空挑选一个数字（不了解目标系统的任何信息，并假设您没有在未显示的代码中做任何“巨大”的事情），我预计每个“工作”的“事件开销”以微秒为单位进行测量。也许一百左右。如果在 JobFunction 中执行算法所花费的时间并不比这个时间多得多，那么您的线程可能会浪费您的时间而不是节省时间。

回复收藏 0 原文

想你只要分分秒秒 2024-08-09 08:27:29

如前所述，线程增加的开销量取决于完成您定义的“作业”所需的相对时间量。因此，找到工作块大小的平衡非常重要，该平衡可以最大限度地减少工作块的数量，但又不会让处理器闲置以等待最后一组计算完成。

您的编码方法通过主动寻找空闲线程来提供新工作，从而增加了开销工作量。操作系统已经在跟踪这一点，并且效率更高。此外，您的函数 ThreadPool::addJob() 可能会发现所有线程都在使用中，并且无法委派工作。但它没有提供与该问题相关的任何返回代码。如果您没有以某种方式检查这种情况并且没有注意到结果中的错误，则意味着始终有空闲的处理器。我建议重新组织代码，以便 addJob() 执行其名称 - 仅添加一个作业（无需查找甚至不关心谁执行该作业），而每个工作线程在完成现有工作后会主动获取新工作。

回复收藏 0 原文

~没有更多了~