在 R 中保存和加载所有环境

发布于 2024-12-04 13:54:05 字数 636 浏览 0 评论 0原文

我正在开发一个包来在 R 中执行分布式计算(github 上 RHadoop 项目下的 rmr)。我试图使事情对用户尽可能透明,并且只是让计算在其他机器上的另一个解释器中继续,就像在同一台机器上一样。原则

lapply(my.list, my.function)

上,每次调用 my.function 都可以发生在集群中的不同节点上,因此需要一个单独的解释器。我使用 saveload 取得了一定程度的成功,但我希望有一个在所有可能情况下都有效的解决方案,而不仅仅是在一个大集合中用例。

无论 my.function 做什么,无论它在哪里定义,无论它引用什么其他对象和包,我想确保如果它在本地工作,它也可以在远程工作,包括加载必要的包和一切。 saveload 保存对象列表并加载文件。来自或到达特定环境。我想找到或编写一些东西来保存和加载所有必要的对象,以便在 my.list 的每个元素上评估 my.function > 在本地和远程将具有相同的语义。

以前做过吗?我应该检查哪些软件包,还有其他建议吗?我认为这是 rmr 中最难的技术问题,您将向 OSS 项目贡献您的解决方案。

I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like

lapply(my.list, my.function)

where each call to my.function can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save and load to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.

No matter what my.function does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including loading the necessary packages and everything. save and load save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function on each of the elements of my.list will have the same semantics locally and remotely.

Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

意犹 2024-12-11 13:54:05

通常 saveload 应该按照您想要的方式工作:保存函数时(实际上,它是保存的“闭包”),定义它的环境是也得救了。如果该函数被定义为包的一部分,则会保存对该包的引用,并在 load 看到该引用时再次加载该包。 (如果包没有命名空间,您在保存时会收到警告)。

唯一的问题应该是全球环境。在那里,还保存了一个引用,但这不会保存全局环境中的所有变量,因此您必须显式保存它们。

保存其他环境(包括其内容),然后递归地保存父环境(除非它是如上所述的包或 globalenv)。

请注意,saveRDSserialize 替代方案提供了更多控制:您可以提供一个 refhook 函数,每当保存环境时都会调用该函数。然后,您可以执行任何您想要存储环境的操作并返回一个字符串 id。加载时,会调用类似的 refhook 从该字符串 id 重新创建环境。然而,你仍然没有被要求拯救全球环境。

e <- new.env() # parent is global env
e$foo <- 42
ee <- new.env(parent=e)
ee$bar <- 13
f <- local(function() foo+bar, ee) 
f() # foo+bar = 55
b <- serialize(f, NULL) # Gives you the serialized bytes

g <- unserialize(b) # Loads from the bytes
g() # 55
# It created new environments...
!identical(environment(g), environment(f))

希望这会有所帮助。

rmr 好运!

Typically save and load should work just as you want: when a function is saved (actually, it's a "closure" that gets saved), the environment where it was defined is also saved. If that function was defined as part of a package, a reference to that package is saved instead, and the package is loaded back in again when load sees the reference. (You get a warning when saving if the package did not have a namespace).

The only problem should be the global environment. There, a reference is also saved but this will not save all the variables in the global environment, so you'd have to save them explicitly.

Other environments are saved including their content, and then recursively the parent environment is also saved (unless its a package or globalenv as described above).

Note that saveRDS and serialize alternatives provides a little more control: you get to provide a refhook function that is called whenever an environment is saved. You then do whatever you want to store the environment and return a string id. When loading, a similar refhook is called upon to recreate the environment from that string id. However, you still do not get called for saving the global environment.

e <- new.env() # parent is global env
e$foo <- 42
ee <- new.env(parent=e)
ee$bar <- 13
f <- local(function() foo+bar, ee) 
f() # foo+bar = 55
b <- serialize(f, NULL) # Gives you the serialized bytes

g <- unserialize(b) # Loads from the bytes
g() # 55
# It created new environments...
!identical(environment(g), environment(f))

Hope this helps a bit.

Good luck with rmr!

暗喜 2024-12-11 13:54:05

在思考这个问题之后更进一步,似乎答案可能对您的问题有用。如果您在保存环境时遇到一些与OP相同的问题,那么Gabor 的回答 可能会帮助您走上正轨。但是,如果基本的序列化和环境保存是问题,我的(诚然不太复杂)答案可能会有所帮助 - 通过 as.list() 转换为列表,然后以通常的方式序列化它,或者考虑通过 JSON 进行序列化;我最喜欢的此类包是 RJSONIO。

然而,汤米的回答对于正在发生的事情提供了更多信息。假设您将广泛调查这些问题,尤其是它们的序列化,我还建议您查看 Tommy 的其他出色见解 在此回答有关环境、闭包和框架的问题。

After thinking about this question a bit further, it seems that the answers may be useful to your problem. If you are having some of the same problems in saving environments as the OP, then Gabor's answer is probably going to help you get on track. However, if basic serialization and saving of environments is the problem, my (admittedly less sophisticated) answer might help - convert to lists via as.list() and then serialize that in the usual way, or consider serialization via JSON; my favorite such package for that is RJSONIO.

Tommy's answer, however, is much more informative about what's going on. Assuming you will be investigating these issues extensively, especially their serialization, I also recommend looking at Tommy's other excellent insights in this answer to a question on environments, closures, and frames.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文