如何在 numpy 和 R 之间传递大型数组?
我正在使用 python 和 numpy/scipy 为文本处理应用程序执行正则表达式和词干提取。但我也想使用 R 的一些统计包。
将数据从 python 传递到 R 的最佳方法是什么? (然后回来?)
另外,我需要在某个时候将阵列备份到磁盘,所以我愿意从 python 保存并加载 R,如果这是最好的解决方案。矩阵非常大(例如 100,000 x 10,000),因此使用稀疏矩阵也可能不错。
如果这是转发,我们深表歉意。我还没有找到任何东西可以将所有这些碎片组合在一起。
I'm using python and numpy/scipy to do regex and stemming for a text processing application. But I want to use some of R's statistical packages as well.
What's the best way to pass the data from python to R? (And back?)
Also, I need to backup the array to disk at some point, so I'm open to saving from python and loading th R if that's the best solution. The matrices are pretty big (e.g. 100,000 x 10,000), so using sparse matrices might also be nice.
Apologies if this is a repost. I haven't been able to find anything that puts all these pieces together.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您是否已经研究过RPy?它是 R 的 Python 接口。我想这会让您省去数据处理的麻烦。
要备份 NumPy 数组,您可以使用 pickle。由于保存大量数据时似乎会产生大量开销,因此最好使用 HDF 标准保存 NumPy 数组。这是一篇涵盖该内容的文章: http ://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
Have you already looked into RPy? It's a python interface to R. I guess that would spare you the data handling.
To backup your NumPy arrays you can use pickle. As it seems to create a lot of overhead when saving huge data, NumPy arrays are best saved using the HDF standard. Here's a article covering that: http://www.shocksolution.com/2010/01/10/storing-large-numpy-arrays-on-disk-python-pickle-vs-hdf5adsf/
使用 Rpy http://rpy.sourceforge.net/ 从 Python 调用 R。
需要注意的是,R 和 Python 版本都需要与构建 Rpy 二进制文件的版本完全相同。因此,您需要小心安装。
Use Rpy, http://rpy.sourceforge.net/, to call R from Python.
The caveat is that both R and Python versions need to be exactly the one for which the Rpy binary has been built. You thus need to be careful with the installation.
我无法评论 R 和 Python 之间共享的“大数据”,但使用 pyRserve 优于 RPy 或 RPy2。
话虽这么说,我很好奇你正在做的文本处理?显然,Python 在文本处理方面有很多功能,但从统计数据来看,像 NLTK 这样的包也有很多功能。以及来自 CLiPS 的模式包。您只是更习惯在 R 中进行统计,还是 Python 中缺少某些特定内容?
I cannot comment on "large data" between shared between R and Python, but I have had a much easier time working with pyRserve than RPy or RPy2.
That being said, I am curious about the text processing you are doing? Python obviously has a lot to offer on the text processing side, but statistically there is a lot too in packages like NLTK and the Pattern package from CLiPS. Are you just more comfortable doing stats in R, or is there something specific missing in Python?