与文件和DB一起在Python中进行并行工作的正确方法？

发布于 2025-02-13 15:50:31 字数 1352 浏览 0 评论 0原文

默认情况下，我有大量的文件名（来自我的PC），并具有新的状态。我想要每个文件名进行一些操作（更改文件）。在更改期间，文件更改文件状态到挖掘。操作更改处理状态后。我应该使用多处理Python模块来完成。现在，我有这个解决方案，但我认为这是不正确的，因为每个文件都不会运行一次。

...imports

myclient = pymongo.MongoClient(...)
mydb = myclient["file_list"]
mycol = mydb["file_list"]

def test_func(path_to_files):
    for file in glob.glob(path_to_files + "/*.jpg"):
        fileDB = mycol.find_one({'name': file})
        if (fileDB.get('status') == 'new'):
            query = {"name": file}
            processing = {"$set": {"status": "processing"}}
            mycol.update_one(query, processing)

            print('update', file)
            ...operations with file...

            processed = {"$set": {"status": "processed"}}
            mycol.update_one(query, processed)
        else: continue


if __name__ == '__main__':
    start = time.time()
    processes = []
    num_processes = mp.cpu_count()

    for i in range(num_processes):
        process = mp.Process(target=test_func, args=(path_to_files,))
        processes.append(process)

    for process in processes:
        process.start()

    for process in processes:
        process.join()

    end = time.time()
    print(end - start)

我的打印（“更新”，文件）在每个过程中都显示相同的文件。我想同时进行此工作，以提高我的程序速度和标记已处理过的文件。

请告诉我我做错了什么。这是做我想做的事情的正确方法，或者我可以以不同的方式执行此操作？

我会很高兴任何建议。

我是新手Python。

原文

I have very large number of file names (from my PC) inserted in db with status New by default. I want for every file name do some operations (change file). During change file change file status to Proccesing. After operations change status on Processed.
I deside to do it with multiprocessing python module. Right now i have this solution but i think it incorrect becasuse functions run for every file not once.

...imports

myclient = pymongo.MongoClient(...)
mydb = myclient["file_list"]
mycol = mydb["file_list"]

def test_func(path_to_files):
    for file in glob.glob(path_to_files + "/*.jpg"):
        fileDB = mycol.find_one({'name': file})
        if (fileDB.get('status') == 'new'):
            query = {"name": file}
            processing = {"$set": {"status": "processing"}}
            mycol.update_one(query, processing)

            print('update', file)
            ...operations with file...

            processed = {"$set": {"status": "processed"}}
            mycol.update_one(query, processed)
        else: continue


if __name__ == '__main__':
    start = time.time()
    processes = []
    num_processes = mp.cpu_count()

    for i in range(num_processes):
        process = mp.Process(target=test_func, args=(path_to_files,))
        processes.append(process)

    for process in processes:
        process.start()

    for process in processes:
        process.join()

    end = time.time()
    print(end - start)

My print('update', file) show same file for every process.I wanna do this work in parallel for increasing my programm speed and mark already processed files.

Tell me please what i'm doing wrong. It is correct way to do what i want or i can do this in different way?

I would be happy any suggestion.

I am new in python.

分享到QQ

分享到微博