使用 Python 在 ArcGIS 中实现多线程
我有一个 python 脚本,单独运行时效果很好。它基于硬编码的输入目录扫描所有 .mdb 文件并将其放入列表中,然后在 for 循环中迭代所有文件。每次迭代都涉及多个表限制、联接、查询等。
唯一的问题.. 在输入数据集上运行大约需要 36 小时,虽然在本例中此脚本仅用于此数据集,但我希望提高性能,因为我经常编辑字段选择、要包括的结果,我想说这需要很长时间,因为我的脚本效率低下,但任何低效都会很小,因为几乎所有处理时间都专用于地理处理器对象。
我的主脚本中所有相关的内容是:
indir = "D:\\basil\\input"
mdblist = createDeepMdbList(indir)
for infile in mdblist:
processMdb(infile)
它在顺序执行时也能完美执行。
我尝试过使用并行Python:
ppservers = ()
job_server = pp.Server(ppservers=ppservers)
inputs = tuple(mdblist)
functions = (preparePointLayer, prepareInterTable, jointInterToPoint,\
prepareDataTable, exportElemTables, joinDatatoPoint, exportToShapefile)
modules = ("sys", "os", "arcgisscripting", "string", "time")
fn = pp.Template(job_server, processMdb, functions, modules)
jobs = [(input, fn.submit(input)) for input in inputs]
它成功创建了8个进程、8个地理处理器对象...然后失败了。
我还没有对内置的 Python 多线程工具进行广泛的实验,但希望得到一些指导来简单地生成最多 8 个进程,通过 mdblist 表示的队列。任何时候都不会尝试由多个进程同时写入或读取任何文件。为了让事情暂时变得简单,出于这个考虑,我还删除了所有日志工具;我已经运行这个脚本足够多次,知道它可以工作,除了 4104 输入的 4 个文件的数据格式略有不同。
建议?尝试多线程 Arc Python 脚本有何智慧?
I have a python script that works great when run by itself. Based on a hardcoded input directory it scans for all .mdb files and puts that into a list, then iterates through them all in a for loop. Each iteration involves multiple table restrictions, joins, queries, and more.
The only problem.. it takes about 36 hours to run on the input dataset and while this script will only ever be used for this dataset in this instance, I would like to increase the performance as I often edit field selections, results to include, join methods, etc. I would like to say it takes a long time because my script is inefficient, but any inefficiency would be small as nearly ALL processing time is dedicated to the geoprocessor object.
All I have of relevance in my main script is:
indir = "D:\\basil\\input"
mdblist = createDeepMdbList(indir)
for infile in mdblist:
processMdb(infile)
It also executes flawlessly when executed sequentially.
I have tried using Parallel Python:
ppservers = ()
job_server = pp.Server(ppservers=ppservers)
inputs = tuple(mdblist)
functions = (preparePointLayer, prepareInterTable, jointInterToPoint,\
prepareDataTable, exportElemTables, joinDatatoPoint, exportToShapefile)
modules = ("sys", "os", "arcgisscripting", "string", "time")
fn = pp.Template(job_server, processMdb, functions, modules)
jobs = [(input, fn.submit(input)) for input in inputs]
It succeeds to create 8 processes, 8 geoprocessor objects... and then fails.
I have not experimented extensively with the built in Python multithreading tools but was hoping for some guidance to simply spawn up to 8 processes going through the queue represented by the mdblist. At no point would any files be attempted to be written or read by multiple processes at the same time. To make things temporarily simpler I have also removed all my logging tools due to this concern; I have run this script enough times to know that it works except for the 4 files of the input of 4104 that have slightly different data formats.
Advice? Wisdom with trying to multithread Arc Python scripts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想我会分享最终对我有用的东西和我的经历。
按照 Joe 的评论,使用多处理模块的向后移植(code.google.com/p/python-multiprocessing)效果很好。我必须在脚本中更改一些内容来处理本地/全局变量和日志记录。
主要脚本现在是:
使用 6 个进程的总时间从约 36 小时减少到约 8 小时。
我遇到的一些问题是,通过使用单独的进程,它们会处理不同的内存堆栈并完全取出全局变量。队列可以用于此目的,但我尚未实现此功能,因此所有内容都只是在本地声明。
此外,由于 pool.map 只能采用一个参数,因此每次迭代都必须创建然后删除地理处理器对象,而不是能够创建 8 个 gp 并将一个可用的 gp 传递给每次迭代。每次迭代大约需要一分钟,因此创建它的几秒钟并不是什么大问题,但它会累积起来。我还没有做过任何具体的测试,但这实际上可能是一个很好的实践,因为任何使用过 Arcgis 和 python 的人都会知道,地理处理器处于活动状态的时间越长,脚本的速度就会大大减慢(例如,我的一个脚本被同事使用)超负荷输入和完成时间估计的工人从运行 1 小时后的 50 小时,到运行一夜后的 350 小时,再到运行 2 天后的 800 小时……它被取消了,输入受到限制)。
希望这可以帮助任何其他想要多重处理大量不稳定输入的人:)。下一步:递归、多处理追加!
Thought I'd share what ended up working for me and my experiences.
Using the backport of the multiprocessing module (code.google.com/p/python-multiprocessing) as per Joe's comment worked well. I had to change a couple things around in my script to deal with local/global variables and logging.
Main script is now:
Total time went from ~36 hours to ~8 using 6 processes.
Some issues I encountered were that by using separate processes, they address different memory stacks and take global variables out entirely. Queues can be used for this but I have not implemented this so everything is just declared locally.
Furthermore, since pool.map can only take one argument, each iteration must create and then delete the geoprocessor object rather than being able to create 8 gp's and pass an available one to each iteration. Each iteration takes about a minute so the couple seconds to create it is not a big deal, but it adds up. I have not done any concrete tests, but this could actually be good practice as anyone who has worked with Arcgis and python will know that scripts drastically slow down the longer the geoprocessor is active (eg. One of my scripts was used by a co-worker who overloaded the input and time estimates to completion went from 50 hours after 1 hour run time to 350 hours after running overnight to 800 hours after running 2 days... it got cancelled and input restricted).
Hope that helps anyone else looking to multiprocess a large itterable input :). Next step: recursive, multiprocessed appends!
我在同一个函数中比较了上述方法。
结果:
I compared the above methods in the same function.
the result: