并行处理和临时文件

发布于 2024-10-21 20:50:27 字数 879 浏览 4 评论 0原文

我使用 multicore 包中的 mclapply 函数来进行并行处理。似乎所有启动的子进程都会为 tempfile 函数给出的临时文件生成相同的名称。即如果我有四个处理器,

library(multicore)
mclapply(1:4, function(x) tempfile())

将给出四个完全相同的文件名。显然,我需要不同的临时文件,以便子进程不会覆盖彼此的文件。当间接使用 tempfile 时,即调用某个调用 tempfile 的函数时,我无法控制文件名。

有办法解决这个问题吗?其他 R 并行处理包(例如 foreach)是否也有同样的问题?

更新:自 R 2.14.1 以来,这不再是问题。

CHANGES IN R VERSION 2.14.0 patched:

[...]

o tempfile() on a Unix-alike now takes the process ID into account.
  This is needed with multicore (and as part of parallel) because
  the parent and all the children share a session temporary
  directory, and they can share the C random number stream used to
  produce the uniaue part.  Further, two children can call
  tempfile() simultaneously.

I'm using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile function. i.e. if I have four processors,

library(multicore)
mclapply(1:4, function(x) tempfile())

will give four exactly same filenames. Obviously I need the temporary files to be different so that the child processes don't overwrite each others' files. When using tempfile indirectly, i.e. calling some function that calls tempfile I have no control over the filename.

Is there a way around this? Do other parallel processing packages for R (e.g. foreach) have the same problem?

Update: This is no longer an issue since R 2.14.1.

CHANGES IN R VERSION 2.14.0 patched:

[...]

o tempfile() on a Unix-alike now takes the process ID into account.
  This is needed with multicore (and as part of parallel) because
  the parent and all the children share a session temporary
  directory, and they can share the C random number stream used to
  produce the uniaue part.  Further, two children can call
  tempfile() simultaneously.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

徒留西风 2024-10-28 20:50:27

我相信多核为每个子任务分离出一个单独的进程。如果这个假设是正确的,那么您应该能够使用 Sys.getpid() 来“播种”临时文件:

tempfile(pattern=paste("foo", Sys.getpid(), sep=""))

I believe multicore spins off a separate process for each subtask. If that assumption is correct, then you should be able to use Sys.getpid() to "seed" tempfile:

tempfile(pattern=paste("foo", Sys.getpid(), sep=""))
与之呼应 2024-10-28 20:50:27

在您的函数中使用 x

mclapply(1:4, function(x) tempfile(pattern=paste("file",x,"-",sep=""))

Use the x in your function:

mclapply(1:4, function(x) tempfile(pattern=paste("file",x,"-",sep=""))
不寐倦长更 2024-10-28 20:50:27

因为并行作业全部同时运行,并且随机种子来自系统时间,所以并行运行四个临时文件实例通常会产生相同的结果(如果您有 4 个核心,也就是说,如果您只有两个核心)核心,您将获得两对相同的临时文件名)。

最好先生成临时文件名称并将它们作为参数提供给您的函数:

filenames <- tempfile( rep("file",4) )
mclapply( filenames, function(x){})

如果您使用的是其他人的函数,其中包含临时文件调用,则通过修改临时文件函数将 PID 应用于临时文件名称,如前所述,可能是最简单的计划:

tempfile <- function( pattern = "file", tmpdir = tempdir(), fileext = ""){
   .Internal(tempfile(paste("pid", Sys.getpid(), pattern, sep=""), tmpdir, fileext))}
mclapply( 1:4, function(x) tempfile() )

Because the parallel jobs all run at the same time, and because the random seed comes from the system time, running four instances of tempfile in parallel will typically produce the same results (if you have 4 cores, that is. If you only have two cores, you'll get two pairs of identical temp file names).

Better to generate the tempfile names first and give them to your function as an argument:

filenames <- tempfile( rep("file",4) )
mclapply( filenames, function(x){})

If you're using someone else's function that has a tempfile call in it, then working the PID into the tempfile name by modifying the tempfile function, as previously suggested, is probably the simplest plan:

tempfile <- function( pattern = "file", tmpdir = tempdir(), fileext = ""){
   .Internal(tempfile(paste("pid", Sys.getpid(), pattern, sep=""), tmpdir, fileext))}
mclapply( 1:4, function(x) tempfile() )
白芷 2024-10-28 20:50:27

至少现在,我选择按照 Daniel 使用 PID 值的建议,在我的 .Rprofile 中使用以下代码来解决这个问题。

assignInNamespace("tempfile.orig", tempfile, ns="base")
.tempfile = function(pattern="file", tmpdir=tempdir())
    tempfile.orig(paste(pattern, Sys.getpid(), sep=""), tmpdir)
assignInNamespace("tempfile", .tempfile, ns="base")

显然,对于您要分发的任何软件包来说,这都不是一个好的选择,但对于单个用户的需求来说,这是迄今为止最好的选择,因为它适用于所有情况。

At least for now, I chose to monkey-patch my way around this by using the following code in my .Rprofile following Daniel's advice to use PID values.

assignInNamespace("tempfile.orig", tempfile, ns="base")
.tempfile = function(pattern="file", tmpdir=tempdir())
    tempfile.orig(paste(pattern, Sys.getpid(), sep=""), tmpdir)
assignInNamespace("tempfile", .tempfile, ns="base")

Obviously it's not a good option for any package you'd distribute, but for a single user's need it's the best option thus far since it works in all cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文