python c 扩展:多线程和随机数
我已经用 C 实现了工作队列模式(在 python 扩展中),但我对性能感到失望。
我有一个包含粒子(“元素”)列表的模拟,并且对执行时间步所需的所有计算所需的时间进行基准测试,并记录该时间以及所涉及的粒子数量。我在四核超线程 i7 上运行代码,因此我预计性能会随着线程数量达到约 8 个而上升(所需时间下降),但最快的实现没有工作线程(函数只是执行而不是添加到队列中,)并且随着每个工作线程,代码变得越来越慢(比每个新线程的无线程实现的时间多一步!)我快速浏览了我的处理器使用情况应用程序,似乎 python 从未真正超过无论运行多少线程,CPU 使用率均为 130%。该机器还有足够的空间,总体系统利用率约为 200%。
现在,我的队列实现的一部分(如下所示)是从队列中随机选择一个项目,因为每个工作项目的执行都需要锁定两个元素,并且相似的元素将在队列中彼此靠近。因此,我希望线程选择随机索引并攻击队列的不同位,以最大限度地减少互斥冲突。
现在,我读到我最初使用 rand()
的尝试会很慢,因为我的随机数不是线程安全的(这句话有意义吗?不确定......
)尝试使用 random()
和 drand48_r
实现(尽管不幸的是,后者似乎在 OS X 上不可用),但统计数据无济于事。
也许其他人可以告诉我问题的原因是什么?代码(工作函数)如下,如果您认为任何queue_add函数或构造函数也可能有用,请大声喊叫。
void* worker_thread_function(void* untyped_queue) {
queue_t* queue = (queue_t*)untyped_queue;
int success = 0;
int rand_id;
long int temp;
work_item_t* work_to_do = NULL;
int work_items_completed = 0;
while (1) {
if (pthread_mutex_lock(queue->mutex)) {
// error case, try again:
continue;
}
while (!success) {
if (queue->queue->count == 0) {
pthread_mutex_unlock(queue->mutex);
break;
}
// choose a random item from the work queue, in order to avoid clashing element mutexes.
rand_id = random() % queue->queue->count;
if (!pthread_mutex_trylock(((work_item_t*)queue->queue->items[rand_id])->mutex)) {
// obtain mutex locks on both elements for the work item.
work_to_do = (work_item_t*)queue->queue->items[rand_id];
if (!pthread_mutex_trylock(((element_t*)work_to_do->element_1)->mutex)){
if (!pthread_mutex_trylock(((element_t*)work_to_do->element_2)->mutex)) {
success = 1;
} else {
// only locked element_1 and work item:
pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
pthread_mutex_unlock(work_to_do->mutex);
work_to_do = NULL;
}
} else {
// couldn't lock element_1, didn't even try 2:
pthread_mutex_unlock(work_to_do->mutex);
work_to_do = NULL;
}
}
}
if (work_to_do == NULL) {
if (queue->queue->count == 0 && queue->exit_flag) {
break;
} else {
continue;
}
}
queue_remove_work_item(queue, rand_id, NULL, 1);
pthread_mutex_unlock(work_to_do->mutex);
pthread_mutex_unlock(queue->mutex);
// At this point, we have mutex locks for the two elements in question, and a
// work item no longer visible to any other threads. we have also unlocked the main
// shared queue, and are free to perform the work on the elements.
execute_function(
work_to_do->interaction_function,
(element_t*)work_to_do->element_1,
(element_t*)work_to_do->element_2,
(simulation_parameters_t*)work_to_do->params
);
// now finished, we should unlock both the elements:
pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
pthread_mutex_unlock(((element_t*)work_to_do->element_2)->mutex);
// and release the work_item RAM:
work_item_destroy((void*)work_to_do);
work_to_do = NULL;
work_items_completed++;
success = 0;
}
return NULL;
}
I have implemented a work queue pattern in C (within a python extension) and I am disappointed with performance.
I have a simulation with a list of particles ("elements"), and I benchmark the time taken to perform all the calculations required for a timestep and record this along with the number of particles involved. I am running the code on a quad-core hyperthreaded i7, so I was expecting for performance to rise (time taken to fall) with the number of threads up to about 8, but instead the fastest implementation has no worker threads (functions are simply executed instead of added to the queue,) and with each worker thread the code gets slower and slower (by a step of more than the time for the unthreaded implementation for each new thread!) I've had a quick peek in my processor usage application, and it seems python never really exceeds 130% CPU usage, regardless of how many threads are running. The machine has plenty of headroom above that, overall system usage at about 200%.
Now part of my queue implementation (shown below) is choosing an item at random from the queue, since each work item's execution requires a lock on two elements and similar elements will be near each other in the queue. Thus, I want the threads to pick random indices and attack different bits of the queue to minimise mutex clashes.
Now, I've read that my initial attempt with rand()
will have been slow because my random numbers weren't thread safe (does that sentence make sense? not sure...)
I've tried the implementation both with random()
and with drand48_r
(although, unfortunately, the latter seems to be unavailable on OS X,) to no avail with the statistics.
Perhaps someone else can tell me what might be the cause of the problem? the code (worker function) is below, and do shout if you think any of the queue_add functions or constructors might be useful to see too.
void* worker_thread_function(void* untyped_queue) {
queue_t* queue = (queue_t*)untyped_queue;
int success = 0;
int rand_id;
long int temp;
work_item_t* work_to_do = NULL;
int work_items_completed = 0;
while (1) {
if (pthread_mutex_lock(queue->mutex)) {
// error case, try again:
continue;
}
while (!success) {
if (queue->queue->count == 0) {
pthread_mutex_unlock(queue->mutex);
break;
}
// choose a random item from the work queue, in order to avoid clashing element mutexes.
rand_id = random() % queue->queue->count;
if (!pthread_mutex_trylock(((work_item_t*)queue->queue->items[rand_id])->mutex)) {
// obtain mutex locks on both elements for the work item.
work_to_do = (work_item_t*)queue->queue->items[rand_id];
if (!pthread_mutex_trylock(((element_t*)work_to_do->element_1)->mutex)){
if (!pthread_mutex_trylock(((element_t*)work_to_do->element_2)->mutex)) {
success = 1;
} else {
// only locked element_1 and work item:
pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
pthread_mutex_unlock(work_to_do->mutex);
work_to_do = NULL;
}
} else {
// couldn't lock element_1, didn't even try 2:
pthread_mutex_unlock(work_to_do->mutex);
work_to_do = NULL;
}
}
}
if (work_to_do == NULL) {
if (queue->queue->count == 0 && queue->exit_flag) {
break;
} else {
continue;
}
}
queue_remove_work_item(queue, rand_id, NULL, 1);
pthread_mutex_unlock(work_to_do->mutex);
pthread_mutex_unlock(queue->mutex);
// At this point, we have mutex locks for the two elements in question, and a
// work item no longer visible to any other threads. we have also unlocked the main
// shared queue, and are free to perform the work on the elements.
execute_function(
work_to_do->interaction_function,
(element_t*)work_to_do->element_1,
(element_t*)work_to_do->element_2,
(simulation_parameters_t*)work_to_do->params
);
// now finished, we should unlock both the elements:
pthread_mutex_unlock(((element_t*)work_to_do->element_1)->mutex);
pthread_mutex_unlock(((element_t*)work_to_do->element_2)->mutex);
// and release the work_item RAM:
work_item_destroy((void*)work_to_do);
work_to_do = NULL;
work_items_completed++;
success = 0;
}
return NULL;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
random() 似乎不是您的问题,因为无论线程数量如何,它都是相同的代码。由于性能会随着线程数量的增加而下降,因此您可能会因锁定开销而被杀死。你真的需要多线程吗?工作函数需要多长时间?您的平均队列深度是多少?随机选择项目似乎是个坏主意。当然,如果队列计数 <= 2,则不需要进行 rand 计算。此外,最好为每个工作线程使用不同的队列并以循环方式插入,而不是随机选择队列索引。或者,至少做一些简单的事情,比如记住上一个线程声明的最后一个索引,然后不选择那个索引。
It doesn't seem like random() is your problem, since it is the same code regardless of number of threads. Since performance goes down with number of threads, likely you are getting killed by locking overhead. Do you really need multiple threads? How long does the work function take, and what is your average queue depth? Selecting items randomly seems like a bad idea. Definitely if queue count is <= 2 you don't need to do the rand calculation. Also, instead of randomly selecting queue index, it would be better to just use a different queue per worker thread and insert in a round-robin fashion. Or, at least something simple like remembering the last index claimed by previous thread and just not picking that one.
Python 线程不是真正的线程。所有Python线程都在同一个操作系统级别的线程中运行,并且由于GIL(全局解释器锁)而一次执行一个。如果工作线程在上下文中的寿命相对较长,那么用进程重写代码可能会成功。
维基百科关于 GIL 的页面
----编辑----
是的,这是在 c 中。但 GIL 仍然很重要。 有关 C 扩展中的线程的信息
Python threads aren't real threads. All python threads are run in the same OS level thread, and are executed one-at-a-time thanks to GIL - the Global Interpreter Lock. Rewriting your code with processes might do the trick, if the workers are relatively long-lived in the context.
Wikipedia's page on GIL
----Edit----
Right, this was in c. But GIL still matters. Info on threads in c extensions
要知道这是否是程序的瓶颈,您必须进行基准测试和检查,但这很可能是可能的。
random() 和具有隐藏状态变量的朋友可能会成为并行编程的严重瓶颈。 如果它们是线程安全的,这通常是通过互斥访问来完成的,所以一切都会变慢。
POSIX 系统上线程安全随机生成器的可移植选择是
erand48
。与 drand48 相比,它接收状态变量作为参数。您只需在每个线程的堆栈上保留一个状态变量(它是一个unsigned Short[3]
),并用它调用erand48
即可。另请记住,这些是伪随机生成器。如果您在不同线程之间使用相同的状态变量,则您的随机数不是独立的。
To know if this is the bottleneck of your program, you'd have to benchmark and check, but it might well be possible.
random()
and friends that have a hidden state variable can be severe bottlenecks for parallel programming. If they are made thread safe, this is usually done by just mutexing the access, so everything slows down.The portable choice for a thread safe random generator on POSIX systems is
erand48
. In contrast todrand48
it receives the state variable as an argument. You'd just have to keep a state variable on the stack of each thread (it is anunsigned short[3]
) and callerand48
with that.Also keep in mind that these are pseudo random generators. If you use the same state variable between different threads your random numbers are not independent.