Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.
Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.
And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.
I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.
Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.
For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.
It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.
However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.
If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.
If the memory is not sufficient enough, one solution is push data to disk and using distributed computing. I think RHadoop(R+Hadoop) may be one of the solution to tackle with large amount dataset.
发布评论
评论(5)
查看在克兰上。 bigmemory and ff 是两个受欢迎的软件包。对于BigMemory(和相关 biganalytics ,and bigtabulate )有一些非常好的演讲,小插曲和杰伊·艾默生(Jay Emerson)的概述。对于FF,我建议阅读AdlerOehlschlägel及其同事在 ff网站上的出色幻灯片演示。
另外,请考虑将数据存储在数据库中,并在较小批次中读取以进行分析。可能有许多方法需要考虑。首先,请查看此演示来自Thomas Lumley。
并在高性能计算任务视图上研究其他软件包,并在其他答案中提到。我上面提到的包裹只是我碰巧拥有更多经验的包装。
Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.
Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.
And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.
我认为您可以处理的数据量比其他任何数据都受到编程技能的限制。尽管许多标准功能都集中在内存分析中,但将数据切成块已经很有帮助。当然,这需要花费更多时间来编程,而不是拿起标准的R代码,但通常是很可能。
可以使用read.table或读取仅读取仅读取数据子集的剪切数据。另外,您可以查看用于从内存功能的包装盒中输出的软件包的高性能计算任务视图。您也可以将数据放在数据库中。对于空间栅格数据,出色的栅格软件包提供了内存分析。
I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.
Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.
对于机器学习任务,我可以建议使用 biglm 执行“对数据太大而无法适应内存的回归”。对于真正的大数据,可以使用 hadoop 作为后端,然后使用软件包 rmr 通过在Hadoop群集上通过MapReduce执行统计(或其他)分析。
For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.
这完全取决于您需要的算法。如果它们可以翻译成增量形式(当在任何给定时刻仅需要数据的一小部分时,例如,对于天真贝叶斯,您只能在内存中保留模型本身,而当前的观察值也可以处理),则最好的建议是逐步进行机器学习,从磁盘读取新的数据。
但是,许多算法及其实现确实需要整个数据集。如果数据集的大小适合您的磁盘(和文件系统限制),则可以使用 mmap 软件包,允许在磁盘上映射文件以存储并在程序中使用它。但是请注意,磁盘读写很昂贵,R有时喜欢经常来回移动数据。所以要小心。
如果您的数据甚至无法存储在您的硬盘驱动器上,则需要使用分布式的机器学习系统。一个这样的基于R的系统是 revolution r 设计为 handle> handle> handle> handle 确实很大的数据集。不幸的是,它不是开源的,而且花费很多,但是您可能会尝试获得免费的学术许可。作为替代方案,您可能对基于java的 apache mahout mahout - 不是那么优雅, ,但基于Hadoop和包括许多重要算法的非常有效的解决方案。
It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.
However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.
If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.
如果内存不够足够,则一个解决方案是将数据推到磁盘并使用分布式计算。我认为Rhadoop(R+Hadoop)可能是大量数据集解决方案之一。
If the memory is not sufficient enough, one solution is push data to disk and using distributed computing. I think RHadoop(R+Hadoop) may be one of the solution to tackle with large amount dataset.