当前位置：文江博客话题详情

在 R 中绘制非常大的数据集

发布于 2024-10-05 22:46:06 字数 94 浏览 8 评论 0原文

如何在 R 中绘制非常大的数据集？

我想使用箱线图、小提琴图或类似的图。内存中无法容纳所有数据。我可以逐步读入并计算制作这些图所需的摘要吗？如果是这样怎么办？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

攀登最高峰 2024-10-12 22:46:06

作为我对Dmitri回答的评论的补充，一个使用 ff 大数据处理包计算分位数的函数：

ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
 stopifnot(all(qs<=1 & qs>=0))
 ffsort(ffv,...)->ffvs
 j<-(qs*(length(ffv)-1))+1
 jf<-floor(j);ceiling(j)->jc
 rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}

这是一个精确的算法，因此它使用排序 - 因此可能需要很多时间。

In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:

ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
 stopifnot(all(qs<=1 & qs>=0))
 ffsort(ffv,...)->ffvs
 j<-(qs*(length(ffv)-1))+1
 jf<-floor(j);ceiling(j)->jc
 rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}

This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.

回复收藏 0 原文

梅窗月明清似水 2024-10-12 22:46:06

问题是你无法将所有数据加载到内存中。因此，您可以对数据进行采样，如@Marek 之前所示。在如此庞大的数据集上，即使只提取 1% 的数据，也会得到基本相同的结果。对于小提琴图，这将为您提供一个不错的密度估计。分位数的渐进计算是不可能的，但这应该给出一个非常不错的近似值。它本质上与 @aix 给出的链接中描述的“随机方法”相同。

如果您无法在 R 之外对日期进行子集化，则可以结合使用连接和 sample() 来完成此操作。当数据帧变得太大时，我用以下函数以文本格式从数据帧中采样数据。如果您稍微使用一下连接，您可以轻松地将其转换为 socketConnection 或其他连接，以便从服务器、数据库等读取它。只需确保以正确的模式打开连接即可。

好，取一个简单的 .csv 文件，然后使用以下函数对数据的一小部分 p 进行采样：

sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #read header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #read and process data
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}

用法示例：

#Make a file
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#clean up
unlink("test.txt")

Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

Good, take a simple .csv file, then following function samples a fraction p of the data:

sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #read header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #read and process data
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}

An example of the usage :

#Make a file
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#clean up
unlink("test.txt")

回复收藏 0 原文

心如狂蝶 2024-10-12 22:46:06

箱线图所需的只是分位数、“晶须”极值和异常值（如果显示），这些都很容易预先计算。看一下 boxplot.stats 函数。

回复收藏 0 原文

稀香 2024-10-12 22:46:06

您还应该查看 RSQLite、SQLiteDF、RODBC 和 biglm 包。对于大型数据集，将数据存储在数据库中并仅将片段拉入 R 中非常有用。数据库还可以为您进行排序，然后计算排序数据的分位数要简单得多（然后只需使用分位数来绘制图）。

还有 hexbin 包（bioconductor），用于对非常大的数据集进行散点图等效操作（可能仍然想使用数据样本，但适用于大样本）。

回复收藏 0 原文

相权↑美人 2024-10-12 22:46:06

您可以将数据放入数据库并使用 SQL 计算分位数。请参阅：http://forge.mysql.com/tools/tool.php?id =149

回复收藏 0 原文

旧瑾黎汐 2024-10-12 22:46:06

这是一个有趣的问题。

箱线图需要分位数。在非常大的数据集上计算分位数是很棘手的。

最简单的解决方案可能适用于您的情况，也可能不适用于您的情况，即首先对数据进行下采样，然后生成样本图。换句话说，一次读取一堆记录，并将其中的子集保留在内存中（确定性或随机选择）。最后，根据内存中保留的数据生成绘图。同样，这是否可行很大程度上取决于数据的属性。

或者，存在可以以“在线”方式经济且近似地计算分位数的算法，这意味着它们一次呈现一个观察结果，并且每个观察结果仅显示一次。虽然我对此类算法的经验有限，但我还没有看到任何现成的 R 实现。

以下论文简要概述了一些相关算法：流上的分位数。

回复收藏 0 原文

绳情 2024-10-12 22:46:06

您可以根据可管理的数据样本绘制图表。例如，如果您仅使用 10% 随机选择的行，则此样本上的箱线图不应与全数据箱线图不同。

如果您的数据位于某个数据库中，您可以创建一些随机标志（据我所知，几乎每个数据库引擎都有某种随机数生成器）。

第二件事是你的数据集有多大？对于箱线图，您需要两列：值变量和组变量。此示例：

N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)

需要 100MB RAM。如果 N=1e7 则它使用 <1GB 的 RAM（现代机器仍然可以管理）。

You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.

If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).

Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:

N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)

needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).

回复收藏 0 原文

若能看破又如何 2024-10-12 22:46:06

也许你可以考虑使用 disk.frame 来汇总数据在运行绘图之前先下来？

回复收藏 0 原文

绅士风度i 2024-10-12 22:46:06

R（以及 Python 和 Julia 等其他语言）的问题是您必须将所有数据加载到内存中才能绘制它。截至 2022 年，最好的解决方案是使用 DuckDB（有一个 R 连接器），它允许您查询非常大的数据集（CSV、parquet 等），并且它附带许多函数来计算汇总统计数据。这个想法是使用 DuckDB 计算这些统计数据，将这些统计数据加载到 R/Python/Julia 中，然后进行绘图。

使用 SQL + R 计算箱线图

您需要大量统计数据来绘制箱线图。如果您想要完整的参考，您可以查看 matplotlib 的代码。代码是用 Python 编写的，但代码非常简单，所以即使您不懂 Python，您也会明白。

最关键的部分是百分位数；您可以像这样在 DuckDB 中计算这些数据（只需更改占位符）：

SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"

您需要一些其他统计数据来创建包含所有详细信息的箱线图。要完整实施，检查此（注意：它是用Python编写的）。我必须为我编写的一个名为 JupySQL 的包实现此功能，该包允许通过以下方式在 Jupyter 中绘制非常大的数据集：利用 DuckDB 等 SQL 引擎。

计算统计数据后，您可以使用 R 生成箱线图。

The problem with R (and other languages like Python and Julia) is that you have to load all your data into memory to plot it. As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.

Computing a boxplot with SQL + R

You need a bunch of statistics to plot a boxplot. If you want a complete reference, you can look at matplotlib's code. The code is in Python, but the code is pretty straightforward, so you'll get it even if you don't know Python.

The most critical piece are percentiles; you can compute those in DuckDB like this (just change the placeholders):

SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"

You need some other statistics to create the boxplot with all its details. For full implementation, check this (note: it's written in Python). I had to implement this for a package I wrote called JupySQL, which allows plotting very large datasets in Jupyter by leveraging SQL engines such as DuckDB.

Once you compute the statistics, you can use R to generate the boxplot.

回复收藏 0 原文

~没有更多了~