R 可以处理多少数据?

发布于 2024-10-28 18:52:27 字数 1539 浏览 6 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

腹黑女流氓 2024-11-04 18:52:28

R 已成为参加 Kaggle.com 数据建模竞赛的开发人员的首选平台,这一事实也许很好地表明了 R 适合“大数据”。请参阅 Revolution Analytics 网站上的文章 - - R 以可观的优势击败了 SAS 和 SPSS。 R 缺乏开箱即用的数字处理能力,但它的灵活性显然弥补了这一点。

除了网络上提供的内容之外,还有几本新书介绍如何到处理大数据的热棒 R。 R 编程艺术(Matloff 2011;No Starch Press)介绍了编写优化的 R 代码、并行计算,并将 R 与 C 结合使用。整本书写得很好,有很棒的代码示例和演练。 Parallel R(McCallum & Weston 2011;O'Reilly)看起来也不错。

Perhaps a good indication of its suitability for "big data" is the fact that R has emerged as the platform of choice for developers competing in Kaggle.com data modeling competitions. See the article on the Revolution Analytics website -- R beats out SAS and SPSS by a healthy margin. What R lacks in out of the box number crunching power it apparently makes up for in flexibility.

In addition to what's available on the web there are several new books for how to hot-rod R for tackling big data. The Art of R Programming (Matloff 2011; No Starch Press) provide introductions to writing optimized R code, parallel computing, and using R in conjunction with C. The entire book is well-written with great code samples and walk-throughs. Parallel R (McCallum & Weston 2011; O'Reilly) looks good too.

白云悠悠 2024-11-04 18:52:28

我将用 R 和大数据集解释我的短篇故事。
我有一个从 R 到 RDBMS 的连接器,

  • 我在其中存储了 8000 万个化合物。

我已经构建了一个收集此数据的一些子集的查询。
然后操纵这个子集。
我的 PC 内存中超过 200k 行,R 简直让人窒息。

  • core duo
  • 4 GB ram

因此,为机器开发一些适当的子集是一个很好的方法。

I'll explain my short story with R and big data set.
I had a connector from R to RDBMS,

  • where I stored 80mln compounds.

I've build a queries which gathered some subset of this data.
Then manipulate on this subset.
R was simply choking with more than 200k rows in memory on my PC.

  • core duo
  • 4 GB ram

So working on some appropriate subset for machine is good approach.

葬花如无物 2024-11-04 18:52:27

如果您查看 CRAN 上的高性能计算任务视图,您会得到一个好主意R 在高性能意义上可以做什么。

If you look at the High-Performance Computing Task View on CRAN, you will get a good idea of what R can do in a sense of high performance.

转角预定愛 2024-11-04 18:52:27

原则上,您可以存储与 RAM 一样多的数据,但例外目前,向量和矩阵仅限于 2^31 - 1 个元素,因为 R 在向量上使用 32 位索引。一般向量(列表及其派生数据帧)仅限于 2^31 - 1 个分量,并且每个分量都具有与向量/矩阵/列表/data.frames 等相同的限制。

当然,这些是理论限制,如果你想对 R 中的数据做任何事情,它不可避免地需要空间来保存至少几个副本,因为 R 通常会将传递到函数等的数据复制。

有人努力允许磁盘存储(而不是在 RAM 中);但即使是那些在 R 中使用时也会受到上面提到的 2^31-1 限制。请参阅@Roman 帖子中链接的高性能计算任务视图的大内存和内存不足数据部分。

You can in principal store as much data as you have RAM with the exception that, currently, vectors and matrices are restricted to 2^31 - 1 elements because R uses 32-bit indexes on vectors. General vectors (lists, and their derivative data frames) are restricted to 2^31 - 1 components, and each of those components has the same restrictions as vectors/matrices/lists/data.frames etc.

Of course these are theoretical limits, if you want to do anything with data in R it will inevitably require space to hold a couple of copies at least, as R will usually copy data passed in to functions etc.

There are efforts to allow on disk storage (rather than in RAM); but even those will be restricted to the 2^31-1 restrictions mentioned above in use in R at any one time. See the Large memory and out-of-memory data section of the High Performance Computing Task View linked to in @Roman's post.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文