与文件和DB一起在Python中进行并行工作的正确方法?
默认情况下,我有大量的文件名(来自我的PC),并具有新的状态。我想要每个文件名进行一些操作(更改文件)。在更改期间,文件更改文件状态到挖掘。操作更改处理状态后。 我应该使用多处理Python模块来完成。现在,我有这个解决方案,但我认为这是不正确的,因为每个文件都不会运行一次。
...imports
myclient = pymongo.MongoClient(...)
mydb = myclient["file_list"]
mycol = mydb["file_list"]
def test_func(path_to_files):
for file in glob.glob(path_to_files + "/*.jpg"):
fileDB = mycol.find_one({'name': file})
if (fileDB.get('status') == 'new'):
query = {"name": file}
processing = {"$set": {"status": "processing"}}
mycol.update_one(query, processing)
print('update', file)
...operations with file...
processed = {"$set": {"status": "processed"}}
mycol.update_one(query, processed)
else: continue
if __name__ == '__main__':
start = time.time()
processes = []
num_processes = mp.cpu_count()
for i in range(num_processes):
process = mp.Process(target=test_func, args=(path_to_files,))
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.join()
end = time.time()
print(end - start)
我的打印(“更新”,文件)在每个过程中都显示相同的文件。我想同时进行此工作,以提高我的程序速度和标记已处理过的文件。
请告诉我我做错了什么。这是做我想做的事情的正确方法,或者我可以以不同的方式执行此操作?
我会很高兴任何建议。
我是新手Python。
I have very large number of file names (from my PC) inserted in db with status New by default. I want for every file name do some operations (change file). During change file change file status to Proccesing. After operations change status on Processed.
I deside to do it with multiprocessing python module. Right now i have this solution but i think it incorrect becasuse functions run for every file not once.
...imports
myclient = pymongo.MongoClient(...)
mydb = myclient["file_list"]
mycol = mydb["file_list"]
def test_func(path_to_files):
for file in glob.glob(path_to_files + "/*.jpg"):
fileDB = mycol.find_one({'name': file})
if (fileDB.get('status') == 'new'):
query = {"name": file}
processing = {"$set": {"status": "processing"}}
mycol.update_one(query, processing)
print('update', file)
...operations with file...
processed = {"$set": {"status": "processed"}}
mycol.update_one(query, processed)
else: continue
if __name__ == '__main__':
start = time.time()
processes = []
num_processes = mp.cpu_count()
for i in range(num_processes):
process = mp.Process(target=test_func, args=(path_to_files,))
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.join()
end = time.time()
print(end - start)
My print('update', file) show same file for every process.I wanna do this work in parallel for increasing my programm speed and mark already processed files.
Tell me please what i'm doing wrong. It is correct way to do what i want or i can do this in different way?
I would be happy any suggestion.
I am new in python.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您所有的过程均进行相同的工作。
如果您有10个文件,您的所有进程都在同一10个文件上运行,那么您的锁(设置状态为处理)太慢了,到第一个过程设置了处理下一个过程的状态时,请通过新检查。 。
查找每个过程的文件,因此,如果您有100个文件和5个进程,1。处理处理1-20和2。进程手柄21-40,依此类推
All your processes are run the same work.
If you have 10 files all your processes are running on the same 10 files, your lock (setting status to processing) is too slow, and by the time the first process is setting the status to processing the next process is passed the if new check.
Look in to split up the file for each process, so if you have 100 files and 5 processes, 1. process handles 1-20 and 2. process handles 21 - 40 and so on