如何使用Python-Multiprocessing来限制许多文件/数据范围?

发布于 2025-02-02 18:02:03 字数 1414 浏览 3 评论 0原文

我对Python和编程相对较新,只将其用于分析仿真数据。 我有一个目录“ result_1/”,其中有150000多个CSV文件,其中包含我想将pandas-dataframe限制的模拟数据。为了探索ReadDir()一次仅读取32K目录条目的问题,我准备了“ files.csv” - 列出目录中的所有文件。

(“ sim”,“ det”和“ run”是我从文件名中读取的信息,并将其作为串联插入到数据框中。为了更好地忽略,我将它们的定义从concat中取出。)

我的问题如下: 该程序需要太多时间来运行,我想使用多处理/线程来加快循环的速度,但是由于我以前从未使用过MP/MT,我什至不知道此处是否可以使用它。

预先感谢您,祝您有美好的一天!

import numpy as np                          
import pandas as pd                         
import os
import multiprocessing as mp

df = pd.DataFrame()
path = 'result_1/'
list = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()

for file in list:
    dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
    sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
    det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
    run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
    dftemp = pd.concat([sim, det, run, dftemp], axis=1)
    df = pd.concat([df, dftemp], axis=0)

df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')

CSV文件看起来像这样:“ 193nr6_run_0038.csv”(fe)

#(8 lines of things I don't need.)
0, 0, 0, 4.621046656438921e-09
1, 0, 0, 4.600856584602298e-09
(... 300 lines of data [x, y, z, dose])

I'm relatively new to python and programming and just use it for the analysis of simulation data.
I have a directory "result_1/" with over 150000 CSV files with simulation data I want to concat into one pandas-dataFrame. To evade problems with readdir() only reading 32K of directory entries at a time, I prepared "files.csv" - listing all the files in the directory.

("sim", "det", and "run" are pieces of information I read from the filenames and insert as Series into the dataFrame. For better overlook, I took their definition out of the concat.)

My problem is as follows:
The program takes too much time to run and I would like to use multiprocessing/-threading to speed up the for-loop, but as I never used mp/mt before, I don't even know if or how it may be used here.

Thank you in advance and have a great day!

import numpy as np                          
import pandas as pd                         
import os
import multiprocessing as mp

df = pd.DataFrame()
path = 'result_1/'
list = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()

for file in list:
    dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
    sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
    det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
    run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
    dftemp = pd.concat([sim, det, run, dftemp], axis=1)
    df = pd.concat([df, dftemp], axis=0)

df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')

The CSV files look like this: "193Nr6_Run_0038.csv" (f.e.)

#(8 lines of things I don't need.)
0, 0, 0, 4.621046656438921e-09
1, 0, 0, 4.600856584602298e-09
(... 300 lines of data [x, y, z, dose])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

春花秋月 2025-02-09 18:02:03

由于CPU和RAM限制,并行处理数据框可能很困难。我不知道您的硬件规格或数据范围的详细信息。但是,我将使用多处理来“解析/制作”数据框,然后将它们串联。以下是一个示例:

import numpy as np                          
import pandas as pd                         
import os
from multiprocessing import Pool


path = 'result_1/'
list_of_files = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()

#make a function to replace the for-loop:
def my_custom_func(file):
    dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
    sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
    det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
    run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
    return pd.concat([sim, det, run, dftemp], axis=1)

#use multiprocessing to process multiple files at once
with Pool(8) as p: #8 processes simultaneously. Avoid using more processes than cores in your CPU
    dataframes = p.map(my_custom_func, list_of_files)

#Finally, concatenate them all
df = pd.concat(dataframes)

df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')

请查看 有关更多信息。

Processing DataFrames in parallel can be difficult due to CPU and RAM limitations. I don't know the specs of your hardware nor the details of your DataFrames. However, I would use multiprocessing to "parse/make" the DataFrames, and then concatenate them afterwards. Here is an example:

import numpy as np                          
import pandas as pd                         
import os
from multiprocessing import Pool


path = 'result_1/'
list_of_files = pd.read_csv('files.csv', encoding='utf_16_le', names=['f'])['f'].values.tolist()

#make a function to replace the for-loop:
def my_custom_func(file):
    dftemp = pd.read_csv(r'{}'.format(os.path.join(path, file)), skiprows=8, names=['x', 'y', 'z', 'dos'], sep=',').drop(['y', 'z'], axis=1)
    sim = pd.Series(int(file.split('Nr')[1].split('_')[0]) * np.ones((300,), dtype=int))
    det = pd.Series(int(file.split('Nr')[0]) * np.ones((300,), dtype=int))
    run = pd.Series(int(file[-8:-4]) * np.ones((300,), dtype=int))
    return pd.concat([sim, det, run, dftemp], axis=1)

#use multiprocessing to process multiple files at once
with Pool(8) as p: #8 processes simultaneously. Avoid using more processes than cores in your CPU
    dataframes = p.map(my_custom_func, list_of_files)

#Finally, concatenate them all
df = pd.concat(dataframes)

df.rename({0:'sim', 1:'det', 2:'run', 3:'x', 4:'dos'}, axis=1).to_csv(r'df.csv')

Have a look at multiprocessing.Pool() for more info.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文