并行处理和临时文件
我使用 multicore
包中的 mclapply
函数来进行并行处理。似乎所有启动的子进程都会为 tempfile
函数给出的临时文件生成相同的名称。即如果我有四个处理器,
library(multicore)
mclapply(1:4, function(x) tempfile())
将给出四个完全相同的文件名。显然,我需要不同的临时文件,以便子进程不会覆盖彼此的文件。当间接使用 tempfile
时,即调用某个调用 tempfile
的函数时,我无法控制文件名。
有办法解决这个问题吗?其他 R 并行处理包(例如 foreach
)是否也有同样的问题?
更新:自 R 2.14.1 以来,这不再是问题。
CHANGES IN R VERSION 2.14.0 patched:
[...]
o tempfile() on a Unix-alike now takes the process ID into account.
This is needed with multicore (and as part of parallel) because
the parent and all the children share a session temporary
directory, and they can share the C random number stream used to
produce the uniaue part. Further, two children can call
tempfile() simultaneously.
I'm using the mclapply
function in the multicore
package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile
function. i.e. if I have four processors,
library(multicore)
mclapply(1:4, function(x) tempfile())
will give four exactly same filenames. Obviously I need the temporary files to be different so that the child processes don't overwrite each others' files. When using tempfile
indirectly, i.e. calling some function that calls tempfile
I have no control over the filename.
Is there a way around this? Do other parallel processing packages for R (e.g. foreach
) have the same problem?
Update: This is no longer an issue since R 2.14.1.
CHANGES IN R VERSION 2.14.0 patched:
[...]
o tempfile() on a Unix-alike now takes the process ID into account.
This is needed with multicore (and as part of parallel) because
the parent and all the children share a session temporary
directory, and they can share the C random number stream used to
produce the uniaue part. Further, two children can call
tempfile() simultaneously.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我相信多核为每个子任务分离出一个单独的进程。如果这个假设是正确的,那么您应该能够使用 Sys.getpid() 来“播种”临时文件:
I believe
multicore
spins off a separate process for each subtask. If that assumption is correct, then you should be able to useSys.getpid()
to "seed" tempfile:在您的函数中使用
x
:Use the
x
in your function:因为并行作业全部同时运行,并且随机种子来自系统时间,所以并行运行四个临时文件实例通常会产生相同的结果(如果您有 4 个核心,也就是说,如果您只有两个核心)核心,您将获得两对相同的临时文件名)。
最好先生成临时文件名称并将它们作为参数提供给您的函数:
如果您使用的是其他人的函数,其中包含临时文件调用,则通过修改临时文件函数将 PID 应用于临时文件名称,如前所述,可能是最简单的计划:
Because the parallel jobs all run at the same time, and because the random seed comes from the system time, running four instances of tempfile in parallel will typically produce the same results (if you have 4 cores, that is. If you only have two cores, you'll get two pairs of identical temp file names).
Better to generate the tempfile names first and give them to your function as an argument:
If you're using someone else's function that has a tempfile call in it, then working the PID into the tempfile name by modifying the tempfile function, as previously suggested, is probably the simplest plan:
至少现在,我选择按照 Daniel 使用 PID 值的建议,在我的
.Rprofile
中使用以下代码来解决这个问题。显然,对于您要分发的任何软件包来说,这都不是一个好的选择,但对于单个用户的需求来说,这是迄今为止最好的选择,因为它适用于所有情况。
At least for now, I chose to monkey-patch my way around this by using the following code in my
.Rprofile
following Daniel's advice to use PID values.Obviously it's not a good option for any package you'd distribute, but for a single user's need it's the best option thus far since it works in all cases.