我刚刚开始研究对Rootfiles,即树的分析。我曾与rdataframes合作,可以使用一行代码( enableImplicitmt()
)启用隐式多处理。这效果很好。
现在,我想尝试显式多处理和拔除,以查看是否可以进一步提高效率。我只需要关于明智的方法的一些指导。
假设我有一个非常大的数据集(无法一次读取)存储在带有几个分支的根文件中。对于分析而言,没有什么疯狂的工作:一些计算,过滤然后填充一些直方图。
我拥有的想法是:
-
琐碎的并行化:以某种方式将rootfile在许多较小的文件中拆分,并在所有文件上并行运行相同的分析。最终重组各自的结果。
-
也许可以在文件中读取并分批分析,如Root Docs中所述,但将其批处理和操作分发给了不同的内核?可以使用Python多处理软件包?
-
与2相似。
如果这些方法值得尝试,或者有更好地处理大型文件的方法,我需要一些反馈。
I just started looking into parallelizing the analysis of rootfiles, i.e. trees. I have worked with RDataFrames where implicit multiprocessing can be enabled with one line of code (EnableImplicitMT()
). This works quite well.
Now I want to experiment with explicit multiprocessing and uproot to see if efficiency can be boosted even further. I just need some guidance on a sensible approach.
Say I have a really large dataset (can not be read in at once) stored in a root file with a couple of branches. Nothing to crazy hast to be done for the analysis: Some calculations, filtering and then filling some histograms maybe.
The ideas I have:
-
Trivial parallelizing: Somehow splitting the rootfile in many smaller files and running the same analysis in parallel on all files. In the end recombining the respective results.
-
Maybe it is possible to read in the file and analyze it in batches as described in the uproot docs but distribute the batches and operations on them to different cores? One could use the python multiprocessing package?
-
Similiar to 2. read in the file in batches but rather than distributing batches to the cores, slicing up the arrays of one batch and distribute the slices and the operation on them to the cores.
I need some feedback if these approaches are worth trying or if there are better ways of handling large files efficiently.
发布评论
评论(1)
引起根除的关键是,它不是进行HEP分析的框架,而仅读取root文件。 HEP分析是下一步 - 编码与Root的任何交互之外。
为了记录, upRoot的文件阅读可以并行化, 但这只是意味着可以使用多个线程等待磁盘/网络和解压缩数据,但是效果是相同的:您要等待所有线程以获取数据很多,也许更快。那不是您要问的。
您希望您的分析代码可以并行运行,这是关于Python中并行处理的通用问题,而不是根深蒂固。您可以将工作分解成碎片(明确),并使每个部分都独立使用根管来读取数据。或者,您可以使用python库进行并行处理来隐式执行此操作,,例如dask ,或者您可以使用一个将这些零件拉在一起的Hep特定库,,例如咖啡。
A key thing to mind about Uproot is that it isn't a framework for doing HEP analysis—it only reads ROOT files. The HEP analysis is the next step—code beyond any interactions with Uproot.
For the record, Uproot's file-reading can be parallelized, but that just means that multiple threads can be employed to wait for disk/network and decompress data, but the effect is the same: you wait for all the threads to be done to get the chunk of data, maybe a little faster. That's not what you're asking about.
You want your analysis code to be running in parallel, and that's a generic question about parallel processing in Python, not Uproot. You can break your work up into pieces (explicitly), and have each of those pieces independently use Uproot to read the data. Or you can use a Python library for parallel-processing to do it implicitly, such as Dask, or you can use a HEP-specific library that pulls these parts together, such as Coffea.