从同一行指示的列返回值

发布于 2024-09-08 14:37:58 字数 604 浏览 11 评论 0原文

我陷入了一个需要一个多小时才能运行的简单循环，并且需要帮助来加快速度。

基本上，我有一个 31 列和 400 000 行的矩阵。前 30 列有值，第 31 列有列号。我需要每行检索第 31 列指示的列中的值。

示例行： [26,354,72,5987..,461,3] （这意味着第 3 列中的值被寻找 (72)）

太慢的循环如下所示：

a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
   a[i] <- data[i,data[i,31]]
}

我认为这会起作用：

a <- data[,data[,31]]

...但它会导致“错误：无法分配大小为 2.8 Mb 的向量”。

我担心这是一个非常简单的问题，所以我花了几个小时试图理解 apply、lapply、reshape 等，但不知何故我无法掌握 R 中的矢量化概念。

矩阵实际上有更多列也进入 a 参数，这就是为什么我不想重建矩阵或拆分它。

非常感谢您的支持！

克里斯

原文

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.

Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.

Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))

The too slow loop looks like this:

a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
   a[i] <- data[i,data[i,31]]
}

I would think this would work:

a <- data[,data[,31]]

... but it results in "Error: cannot allocate vector of size 2.8 Mb".

I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.

The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.

Your support is highly appreciated!

Chris

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘿哥们儿 2024-09-15 14:37:58

t(data[,1:30])[30*(0:399999)+data[,31]]

这是有效的，因为您可以引用数组格式和向量格式（在本例中为 400000*31 长向量）的矩阵，首先按列计数。要按行计数，可以使用转置。

t(data[,1:30])[30*(0:399999)+data[,31]]

This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.

回复收藏 0 原文

冰火雁神 2024-09-15 14:37:58

矩阵的单索引表示法可能使用更少的内存。这将涉及执行以下操作：

i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]

下面是 R 中矩阵的单索引表示法的示例。在此示例中，每行最大值的索引被附加为随机矩阵的最后一列。然后，最后一列用于通过单索引表示法选择每行最大值。

## create a random (10 x 5) matrix                                                                                                                           
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5                                                                                             
## columns                                                                                                                                                   
MM <- cbind(M,apply(M,1,which.max))
##             column ID          row ID                                                                                                                     
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))

使用索引矩阵是一种替代方法可能会使用更多内存，但稍微清晰一些：

ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))

Singe-index notation for the matrix may use less memory. This would involve doing something like:

i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]

Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.

## create a random (10 x 5) matrix                                                                                                                           
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5                                                                                             
## columns                                                                                                                                                   
MM <- cbind(M,apply(M,1,which.max))
##             column ID          row ID                                                                                                                     
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))

Using an index matrix is an alternative that will probably use more memory but is slightly clearer:

ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))

回复收藏 0 原文

凉栀 2024-09-15 14:37:58

尝试更改代码以一次处理一列：

M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
    a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}

如果最后一列的值为 i<，则这会将 a 中的所有元素设置为 i 列中的值/em>。构建矩阵比计算向量a花费的时间更长。

Try to change the code to work a column at a time:

M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
    a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}

This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

回复收藏 0 原文

~没有更多了~