如何使用多处理从Python中的文件列表构造数据

发布于 2025-02-10 06:59:40 字数 1092 浏览 2 评论 0原文

我有兴趣通过实现多处理来加速文件读取时间,但是我很难从每个过程中获取数据。当将所有数据放在一起并且使用Python 3.9时,该顺序确实很重要。

# read files from file list in the given indices
def read_files(files, folder_path):
    raw_data = []
    # loops through all tif files in the given folder and parses the data.
    for file in files:
        if file[-3:] == "tif":
            curr_frame = Image.open(os.path.join(folder_path, file))
            raw_data.append(np.array(curr_frame))
    return np.asarray(raw_data).astype(np.float64)


def run_processes(folder_path=None):
    if folder_path is None:
        global PATH
        folder_path = PATH
    files = os.listdir(folder_path)

    start = time.time()
    processes = []
    num_files_per = int(len(files) / os.cpu_count())
    for i in range(os.cpu_count()):
        processes.append(Process(target=read_files, args=(files[(i*num_files_per):((i+1)*num_files_per)], folder_path)))
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    end = time.time()
    print(f"Multi: {end - start}")

任何帮助都非常感谢!

I am interested in speeding up my file read times by implementing multiprocessing, but I am having trouble getting data back from each process. The order does matter when all the data is put together and I am using Python 3.9.

# read files from file list in the given indices
def read_files(files, folder_path):
    raw_data = []
    # loops through all tif files in the given folder and parses the data.
    for file in files:
        if file[-3:] == "tif":
            curr_frame = Image.open(os.path.join(folder_path, file))
            raw_data.append(np.array(curr_frame))
    return np.asarray(raw_data).astype(np.float64)


def run_processes(folder_path=None):
    if folder_path is None:
        global PATH
        folder_path = PATH
    files = os.listdir(folder_path)

    start = time.time()
    processes = []
    num_files_per = int(len(files) / os.cpu_count())
    for i in range(os.cpu_count()):
        processes.append(Process(target=read_files, args=(files[(i*num_files_per):((i+1)*num_files_per)], folder_path)))
    for process in processes:
        process.start()
    for process in processes:
        process.join()
    end = time.time()
    print(f"Multi: {end - start}")

Any help is much appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时光瘦了 2025-02-17 06:59:40

可能会增加spead,生成文件路径列表,并编写一个工作函数,该函数将单个路径作为参数并返回数据。
如果您将该工人与多处理。pool一起使用,它将负责为您返回数据的详细信息。

请记住,您正在交易时间来读取将数据返回到父进程的开销的文件。
鉴于这不是净改进。

然后是文件读取的问题。由于这些文件大概是在同一设备上,因此您可以在此处进入设备的最大吞吐量。

通常,如果您必须在图像上进行的处理 取决于单个图像,那么在工人中进行处理可能值得,因为这会加快事情的速度。

To potentially increase the spead, generate a list of file paths, and write a worker function that takes a single path as its argument and returns its data.
If you use that worker with a multiprocessing.Pool, it will take care of the details of returning the data for you.

Keep in mind that you are trading the time to read a file for the overhead of returning the data to the parent process.
It is not a given that this is a net improvement.

And then there is the issue of file reads themselves. Since these files are presumably on the same device, you could run into the maximum throughput of the device here.

In general, if the processing you have to do on the images only depends on a single image, it could be worth it to do that processing in in the worker, because that would speed things up.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文