Python 线程与 Pandas Dataframe 并不能提高性能
我有一个 200k 行的 Dataframe,我想分成几个部分并为每个分区调用我的函数 S_Function。
def S_Function(df):
#mycode here
return new_df
主程序
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
我运行线程 & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
代码工作正常,但问题是使用 threading.Thread 并没有减少执行时间。
序列代码:16 分钟
并行代码:15 分钟
有人可以解释一下需要改进什么吗?为什么效果不好?
i have a Dataframe of 200k lines, i want to split into parts and call my function S_Function for each partition.
def S_Function(df):
#mycode here
return new_df
Main program
N_Threads = 10
Threads = []
Out = []
size = df.shape[0] // N_Threads
for i in range(N_Threads + 1):
begin = i * size
end = min(df.shape[0], (i+1)*size)
Threads.append(Thread(target = S_Function, args = (df[begin:end])) )
I run the threads & make the join :
for i in range(N_Threads + 1):
Threads[i].start()
for i in range(N_Threads + 1):
Out.append(Threads[i].join())
output = pd.concat(Out)
The code is working perfectly but the problem is that using threading.Thread
did not decrease the execution time.
Sequential Code : 16 minutes
Parallel Code : 15 minutes
Can someone explain what to improve, why this is not working well?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您必须处理 CPU 密集型操作时,请勿使用
线程
。为了实现你的目标,我认为你应该使用multiprocessing
模块尝试:
Don't use
threading
when you have to process CPU-bound operations. To achieve your goal, I think you should usemultiprocessing
moduleTry: