python c 扩展：多线程和随机数

发布于 2024-12-18 12:41:29 字数 3431 浏览 0 评论 0原文

我已经用 C 实现了工作队列模式（在 python 扩展中），但我对性能感到失望。

我有一个包含粒子（“元素”）列表的模拟，并且对执行时间步所需的所有计算所需的时间进行基准测试，并记录该时间以及所涉及的粒子数量。我在四核超线程 i7 上运行代码，因此我预计性能会随着线程数量达到约 8 个而上升（所需时间下降），但最快的实现没有工作线程（函数只是执行而不是添加到队列中，）并且随着每个工作线程，代码变得越来越慢（比每个新线程的无线程实现的时间多一步！）我快速浏览了我的处理器使用情况应用程序，似乎 python 从未真正超过无论运行多少线程，CPU 使用率均为 130%。该机器还有足够的空间，总体系统利用率约为 200%。

现在，我的队列实现的一部分（如下所示）是从队列中随机选择一个项目，因为每个工作项目的执行都需要锁定两个元素，并且相似的元素将在队列中彼此靠近。因此，我希望线程选择随机索引并攻击队列的不同位，以最大限度地减少互斥冲突。

现在，我读到我最初使用 rand() 的尝试会很慢，因为我的随机数不是线程安全的（这句话有意义吗？不确定......

）尝试使用 random() 和 drand48_r 实现（尽管不幸的是，后者似乎在 OS X 上不可用），但统计数据无济于事。

也许其他人可以告诉我问题的原因是什么？代码（工作函数）如下，如果您认为任何queue_add函数或构造函数也可能有用，请大声喊叫。

void* worker_thread_function(void* untyped_queue) {

  queue_t* queue = (queue_t*)untyped_queue;
  int success = 0;
  int rand_id;
  long int temp;
  work_item_t* work_to_do = NULL;
  int work_items_completed = 0;

  while (1) {
    if (pthread_mutex_lock(queue->mutex)) {

      // error case, try again:
      continue;
    }

    while (!success) {

      if (queue->queue->count == 0) {

        pthread_mutex_unlock(queue->mutex);
        break;
      }

      // choose a random item from the work queue, in order to avoid clashing element mutexes.
      rand_id = random() % queue->queue->count;

      if (!pthread_mutex_trylock(((work_item_t*)queue->queue->items[rand_id])->mutex)) {

        // obtain mutex locks on both elements for the work item.
        work_to_do = (work_item_t*)queue->queue->items[rand_id];

        if (!pthread_mutex_trylock(((element_t*)work_to_do->element_1)->mutex)){ 
          if (!pthread_mutex_trylock(((element_t*)work_to_do->element_2)->mutex)) {

            success = 1;
          } else {

            // only locked element_1 and work item:
            pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
            pthread_mutex_unlock(work_to_do->mutex);
            work_to_do = NULL;
          }
        } else {

          // couldn't lock element_1, didn't even try 2:
          pthread_mutex_unlock(work_to_do->mutex);
          work_to_do = NULL;
        }
      }
    }

    if (work_to_do == NULL) {
       if (queue->queue->count == 0 && queue->exit_flag) {

        break;
      } else {

        continue;
      }
    }

    queue_remove_work_item(queue, rand_id, NULL, 1);
    pthread_mutex_unlock(work_to_do->mutex);

    pthread_mutex_unlock(queue->mutex);

    // At this point, we have mutex locks for the two elements in question, and a
    // work item no longer visible to any other threads. we have also unlocked the main
    // shared queue, and are free to perform the work on the elements.
    execute_function(
      work_to_do->interaction_function,
      (element_t*)work_to_do->element_1,
      (element_t*)work_to_do->element_2,
      (simulation_parameters_t*)work_to_do->params
    );

    // now finished, we should unlock both the elements:
    pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
    pthread_mutex_unlock(((element_t*)work_to_do->element_2)->mutex);

    // and release the work_item RAM:
    work_item_destroy((void*)work_to_do);
    work_to_do = NULL;

    work_items_completed++;
    success = 0;
  }
  return NULL;
}

原文

I have implemented a work queue pattern in C (within a python extension) and I am disappointed with performance.

I have a simulation with a list of particles ("elements"), and I benchmark the time taken to perform all the calculations required for a timestep and record this along with the number of particles involved. I am running the code on a quad-core hyperthreaded i7, so I was expecting for performance to rise (time taken to fall) with the number of threads up to about 8, but instead the fastest implementation has no worker threads (functions are simply executed instead of added to the queue,) and with each worker thread the code gets slower and slower (by a step of more than the time for the unthreaded implementation for each new thread!) I've had a quick peek in my processor usage application, and it seems python never really exceeds 130% CPU usage, regardless of how many threads are running. The machine has plenty of headroom above that, overall system usage at about 200%.

Now part of my queue implementation (shown below) is choosing an item at random from the queue, since each work item's execution requires a lock on two elements and similar elements will be near each other in the queue. Thus, I want the threads to pick random indices and attack different bits of the queue to minimise mutex clashes.

Now, I've read that my initial attempt with rand() will have been slow because my random numbers weren't thread safe (does that sentence make sense? not sure...)

I've tried the implementation both with random() and with drand48_r (although, unfortunately, the latter seems to be unavailable on OS X,) to no avail with the statistics.

Perhaps someone else can tell me what might be the cause of the problem? the code (worker function) is below, and do shout if you think any of the queue_add functions or constructors might be useful to see too.

void* worker_thread_function(void* untyped_queue) {

  queue_t* queue = (queue_t*)untyped_queue;
  int success = 0;
  int rand_id;
  long int temp;
  work_item_t* work_to_do = NULL;
  int work_items_completed = 0;

  while (1) {
    if (pthread_mutex_lock(queue->mutex)) {

      // error case, try again:
      continue;
    }

    while (!success) {

      if (queue->queue->count == 0) {

        pthread_mutex_unlock(queue->mutex);
        break;
      }

      // choose a random item from the work queue, in order to avoid clashing element mutexes.
      rand_id = random() % queue->queue->count;

      if (!pthread_mutex_trylock(((work_item_t*)queue->queue->items[rand_id])->mutex)) {

        // obtain mutex locks on both elements for the work item.
        work_to_do = (work_item_t*)queue->queue->items[rand_id];

        if (!pthread_mutex_trylock(((element_t*)work_to_do->element_1)->mutex)){ 
          if (!pthread_mutex_trylock(((element_t*)work_to_do->element_2)->mutex)) {

            success = 1;
          } else {

            // only locked element_1 and work item:
            pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
            pthread_mutex_unlock(work_to_do->mutex);
            work_to_do = NULL;
          }
        } else {

          // couldn't lock element_1, didn't even try 2:
          pthread_mutex_unlock(work_to_do->mutex);
          work_to_do = NULL;
        }
      }
    }

    if (work_to_do == NULL) {
       if (queue->queue->count == 0 && queue->exit_flag) {

        break;
      } else {

        continue;
      }
    }

    queue_remove_work_item(queue, rand_id, NULL, 1);
    pthread_mutex_unlock(work_to_do->mutex);

    pthread_mutex_unlock(queue->mutex);

    // At this point, we have mutex locks for the two elements in question, and a
    // work item no longer visible to any other threads. we have also unlocked the main
    // shared queue, and are free to perform the work on the elements.
    execute_function(
      work_to_do->interaction_function,
      (element_t*)work_to_do->element_1,
      (element_t*)work_to_do->element_2,
      (simulation_parameters_t*)work_to_do->params
    );

    // now finished, we should unlock both the elements:
    pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
    pthread_mutex_unlock(((element_t*)work_to_do->element_2)->mutex);

    // and release the work_item RAM:
    work_item_destroy((void*)work_to_do);
    work_to_do = NULL;

    work_items_completed++;
    success = 0;
  }
  return NULL;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱格式化 2024-12-25 12:41:29

random() 似乎不是您的问题，因为无论线程数量如何，它都是相同的代码。由于性能会随着线程数量的增加而下降，因此您可能会因锁定开销而被杀死。你真的需要多线程吗？工作函数需要多长时间？您的平均队列深度是多少？随机选择项目似乎是个坏主意。当然，如果队列计数 <= 2，则不需要进行 rand 计算。此外，最好为每个工作线程使用不同的队列并以循环方式插入，而不是随机选择队列索引。或者，至少做一些简单的事情，比如记住上一个线程声明的最后一个索引，然后不选择那个索引。

回复收藏 0 原文