在 R 中保存和加载所有环境
我正在开发一个包来在 R 中执行分布式计算(github 上 RHadoop 项目下的 rmr)。我试图使事情对用户尽可能透明,并且只是让计算在其他机器上的另一个解释器中继续,就像在同一台机器上一样。原则
lapply(my.list, my.function)
上,每次调用 my.function 都可以发生在集群中的不同节点上,因此需要一个单独的解释器。我使用 save
和 load
取得了一定程度的成功,但我希望有一个在所有可能情况下都有效的解决方案,而不仅仅是在一个大集合中用例。
无论 my.function 做什么,无论它在哪里定义,无论它引用什么其他对象和包,我想确保如果它在本地工作,它也可以在远程工作,包括加载必要的包和一切。 save
和 load
保存对象列表并加载文件。来自或到达特定环境。我想找到或编写一些东西来保存和加载所有必要的对象,以便在 my.list
的每个元素上评估 my.function
> 在本地和远程将具有相同的语义。
以前做过吗?我应该检查哪些软件包,还有其他建议吗?我认为这是 rmr 中最难的技术问题,您将向 OSS 项目贡献您的解决方案。
I am developing a package to perform distributed computing in R (rmr under the RHadoop project on github). I am trying to make things as transparent as possible to the user and simply have the computation continue in another interpreter on some other machine as if it were on the same machine. Something like
lapply(my.list, my.function)
where each call to my.function
can in principle happen on a different node in a cluster, hence a separate interpreter. I am using the pair save
and load
to a certain degree of success, but I would like to have a solution that works under all possible circumstances, not just in a large set of use cases.
No matter what my.function
does, no matter where it is defined, no matter what other objects and packages it refers to, I would like to be sure that if it works locally, it also works remotely, including loading the necessary packages and everything. save
and load
save a list of objects and load a file resp. from or to a specific environment. I would like to find or write something that saves and loads all the necessary objects from and to the necessary environments so that evaluating my.function
on each of the elements of my.list
will have the same semantics locally and remotely.
Has this been done before, any packages I should check out, any other suggestions? I think this is the single hardest technical issue in rmr and you would be contributing your solution to an OSS project.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通常
save
和load
应该按照您想要的方式工作:保存函数时(实际上,它是保存的“闭包”),定义它的环境是也得救了。如果该函数被定义为包的一部分,则会保存对该包的引用,并在load
看到该引用时再次加载该包。 (如果包没有命名空间,您在保存时会收到警告)。唯一的问题应该是全球环境。在那里,还保存了一个引用,但这不会保存全局环境中的所有变量,因此您必须显式保存它们。
保存其他环境(包括其内容),然后递归地保存父环境(除非它是如上所述的包或 globalenv)。
请注意,
saveRDS
和serialize
替代方案提供了更多控制:您可以提供一个refhook
函数,每当保存环境时都会调用该函数。然后,您可以执行任何您想要存储环境的操作并返回一个字符串 id。加载时,会调用类似的 refhook 从该字符串 id 重新创建环境。然而,你仍然没有被要求拯救全球环境。希望这会有所帮助。
祝
rmr
好运!Typically
save
andload
should work just as you want: when a function is saved (actually, it's a "closure" that gets saved), the environment where it was defined is also saved. If that function was defined as part of a package, a reference to that package is saved instead, and the package is loaded back in again whenload
sees the reference. (You get a warning when saving if the package did not have a namespace).The only problem should be the global environment. There, a reference is also saved but this will not save all the variables in the global environment, so you'd have to save them explicitly.
Other environments are saved including their content, and then recursively the parent environment is also saved (unless its a package or globalenv as described above).
Note that
saveRDS
andserialize
alternatives provides a little more control: you get to provide arefhook
function that is called whenever an environment is saved. You then do whatever you want to store the environment and return a string id. When loading, a similar refhook is called upon to recreate the environment from that string id. However, you still do not get called for saving the global environment.Hope this helps a bit.
Good luck with
rmr
!在思考这个问题之后更进一步,似乎答案可能对您的问题有用。如果您在保存环境时遇到一些与OP相同的问题,那么Gabor 的回答 可能会帮助您走上正轨。但是,如果基本的序列化和环境保存是问题,我的(诚然不太复杂)答案可能会有所帮助 - 通过
as.list()
转换为列表,然后以通常的方式序列化它,或者考虑通过 JSON 进行序列化;我最喜欢的此类包是 RJSONIO。然而,汤米的回答对于正在发生的事情提供了更多信息。假设您将广泛调查这些问题,尤其是它们的序列化,我还建议您查看 Tommy 的其他出色见解 在此回答有关环境、闭包和框架的问题。
After thinking about this question a bit further, it seems that the answers may be useful to your problem. If you are having some of the same problems in saving environments as the OP, then Gabor's answer is probably going to help you get on track. However, if basic serialization and saving of environments is the problem, my (admittedly less sophisticated) answer might help - convert to lists via
as.list()
and then serialize that in the usual way, or consider serialization via JSON; my favorite such package for that isRJSONIO
.Tommy's answer, however, is much more informative about what's going on. Assuming you will be investigating these issues extensively, especially their serialization, I also recommend looking at Tommy's other excellent insights in this answer to a question on environments, closures, and frames.