在 R 中分配矩阵的最佳方法,NULL vs NA?

发布于 2024-08-11 12:59:36 字数 766 浏览 4 评论 0原文

我正在编写 R 代码来创建方阵。所以我的方法是:

  1. 分配一个正确大小的矩阵
  2. 循环遍历矩阵的每个元素并用适当的值填充它

我的问题非常简单:预分配该矩阵的最佳方法是什么?到目前为止,我有两种方法:

> x <- matrix(data=NA,nrow=3,ncol=3)
> x
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
[3,]   NA   NA   NA

> x <- list()
> length(x) <- 3^2
> dim(x) <- c(3,3)
> x
     [,1] [,2] [,3]
[1,] NULL NULL NULL
[2,] NULL NULL NULL
[3,] NULL NULL NULL

据我所知,前者是比后者更简洁的方法。此外,前者用 NA 填充矩阵,而后者用 NULL 填充。

哪种“更好”的方法可以做到这一点?在这种情况下,我将“更好”定义为“更好的性能”,因为这是统计计算,并且此操作将在大型数据集上进行。

虽然前者更简洁,但它并没有那么容易理解,所以我觉得这两种方式都可以。

另外,R中的NA和NULL有什么区别? ?NA 和 ?NULL 告诉我“NA”的长度为“1”,而 NULL 的长度为“0” - 但这里还有更多吗?或者最佳实践?这将影响我用来创建矩阵的方法。

I am writing R code to create a square matrix. So my approach is:

  1. Allocate a matrix of the correct size
  2. Loop through each element of my matrix and fill it with an appropriate value

My question is really simple: what is the best way to pre-allocate this matrix? Thus far, I have two ways:

> x <- matrix(data=NA,nrow=3,ncol=3)
> x
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
[3,]   NA   NA   NA

or

> x <- list()
> length(x) <- 3^2
> dim(x) <- c(3,3)
> x
     [,1] [,2] [,3]
[1,] NULL NULL NULL
[2,] NULL NULL NULL
[3,] NULL NULL NULL

As far as I can see, the former is a more concise method than the latter. Also, the former fills the matrix with NAs, whereas the latter is filled with NULLs.

Which is the "better" way to do this? In this case, I'm defining "better" as "better performance", because this is statistical computing and this operation will be taking place with large datasets.

While the former is more concise, it isn't breathtakingly easier to understand, so I feel like this could go either way.

Also, what is the difference between NA and NULL in R? ?NA and ?NULL tell me that "NA" has a length of "1" whereas NULL has a length of "0" - but is there more here? Or a best practice? This will affect which method I use to create my matrix.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

ぃ弥猫深巷。 2024-08-18 12:59:36

如有疑问,请测试一下自己。第一种方法既简单又快捷。

> create.matrix <- function(size) {
+ x <- matrix()
+ length(x) <- size^2
+ dim(x) <- c(size,size)
+ x
+ }
> 
> system.time(x <- matrix(data=NA,nrow=10000,ncol=10000))
   user  system elapsed 
   4.59    0.23    4.84 
> system.time(y <- create.matrix(size=10000))
   user  system elapsed 
   0.59    0.97   15.81 
> identical(x,y)
[1] TRUE

关于NA和NULL的区别:

实际上有四个特殊的常量。

此外,还有四个特殊常量:NULL、NA、Inf、NaN。

NULL 用于指示空对象。 NA 用于不存在(“不可用”)的数据值。 Inf 表示无穷大,NaN 在 IEEE 浮点演算中不是数字(例如,分别为 1/0 和 0/0 的运算结果)。

您可以在有关语言定义的 R 手册。

When in doubt, test yourself. The first approach is both easier and faster.

> create.matrix <- function(size) {
+ x <- matrix()
+ length(x) <- size^2
+ dim(x) <- c(size,size)
+ x
+ }
> 
> system.time(x <- matrix(data=NA,nrow=10000,ncol=10000))
   user  system elapsed 
   4.59    0.23    4.84 
> system.time(y <- create.matrix(size=10000))
   user  system elapsed 
   0.59    0.97   15.81 
> identical(x,y)
[1] TRUE

Regarding the difference between NA and NULL:

There are actually four special constants.

In addition, there are four special constants, NULL, NA, Inf, and NaN.

NULL is used to indicate the empty object. NA is used for absent (“Not Available”) data values. Inf denotes infinity and NaN is not-a-number in the IEEE floating point calculus (results of the operations respectively 1/0 and 0/0, for instance).

You can read more in the R manual on language definition.

千纸鹤带着心事 2024-08-18 12:59:36

根据这篇文章通过使用NA_real_进行预分配,我们可以比使用NA进行预分配做得更好。来自文章:

一旦您为“x”中的任何单元格分配了数值,在分配新值时,矩阵首先必须强制为数字。原来分配的逻辑矩阵被徒劳地分配,只是增加了不必要的内存占用和垃圾收集器的额外工作。
而是使用 NA_real_ (或 NA_integer_ 对于整数)来分配它

按照建议:让我们测试一下。

testfloat = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 1.2
  }
}

>system.time(testfloat(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
3.08    0.24    3.32 
> system.time(testfloat(matrix(data=NA_real_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.91    0.23    3.14 

对于整数:

testint = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 3
  }
}

> system.time(testint(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.96    0.29    3.31 
> system.time(testint(matrix(data=NA_integer_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.92    0.35    3.28 

在我的测试用例中差异很小,但它确实存在。

According to this article we can do better than preallocating with NA by preallocating with NA_real_. From the article:

as soon as you assign a numeric value to any of the cells in 'x', the matrix will first have to be coerced to numeric when a new value is assigned. The originally allocated logical matrix was allocated in vain and just adds an unnecessary memory footprint and extra work for the garbage collector.
Instead allocate it using NA_real_ (or NA_integer_ for integers)

As recommended: let's test it.

testfloat = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 1.2
  }
}

>system.time(testfloat(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
3.08    0.24    3.32 
> system.time(testfloat(matrix(data=NA_real_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.91    0.23    3.14 

And for integers:

testint = function(mat){
  n=nrow(mat)
  for(i in 1:n){
    mat[i,] = 3
  }
}

> system.time(testint(matrix(data=NA,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.96    0.29    3.31 
> system.time(testint(matrix(data=NA_integer_,nrow=1e4,ncol=1e4)))
user  system elapsed 
2.92    0.35    3.28 

The difference is small in my test cases, but it's there.

没︽人懂的悲伤 2024-08-18 12:59:36
rows<-3
cols<-3    
x<-rep(NA, rows*cols)
x1 <- matrix(x,nrow=rows,ncol=cols)
rows<-3
cols<-3    
x<-rep(NA, rows*cols)
x1 <- matrix(x,nrow=rows,ncol=cols)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文