将多个函数应用于数据框的每一行
每当我认为我了解了向量的使用时,一个看似简单的问题就会让我的头脑翻天覆地。在这种情况下,大量阅读和尝试不同的例子并没有帮助。请在这里用勺子喂我...
我想将两个自定义函数应用于数据框的每一行,并将结果添加为两个新列。这是我的示例代码:
# Required packages:
library(plyr)
FindMFE <- function(x) {
MFE <- max(x, na.rm = TRUE)
MFE <- ifelse(is.infinite(MFE ) | (MFE < 0), 0, MFE)
return(MFE)
}
FindMAE <- function(x) {
MAE <- min(x, na.rm = TRUE)
MAE <- ifelse(is.infinite(MAE) | (MAE> 0), 0, MAE)
return(MAE)
}
FindMAEandMFE <- function(x){
# I know this next line is wrong...
z <- apply(x, 1, FindMFE, FindMFE)
return(z)
}
df1 <- data.frame(Bar1=c(1,2,3,-3,-2,-1),Bar2=c(3,1,3,-2,-3,-1))
df1 = transform(df1,
FindMAEandMFE(df1)
)
#DF1 should end up with the following data...
#Bar1 Bar2 MFE MAE
#1 3 3 0
#2 1 2 0
#3 3 3 0
#-3 -2 0 -3
#-2 -3 0 -3
#-1 -1 0 -1
如果使用 plyr 库和更基础的方法获得答案,那就太好了。两者都有助于我的理解。当然,如果有明显的错误,请指出我哪里错了。 ;-)
现在回到我的帮助文件!
编辑:我想要一个多元解决方案,因为列名称可能会随着时间的推移而改变和扩展。它还允许将来重复使用代码。
Every time I think I understand about working with vectors, what appears to be a simple problem turns my head inside out. Lot's of reading and trying different examples hasn't helped on this occasion. Please spoon feed me here...
I want to apply two custom functions to each row of a dataframe and add the results as a two new columns. Here is my sample code:
# Required packages:
library(plyr)
FindMFE <- function(x) {
MFE <- max(x, na.rm = TRUE)
MFE <- ifelse(is.infinite(MFE ) | (MFE < 0), 0, MFE)
return(MFE)
}
FindMAE <- function(x) {
MAE <- min(x, na.rm = TRUE)
MAE <- ifelse(is.infinite(MAE) | (MAE> 0), 0, MAE)
return(MAE)
}
FindMAEandMFE <- function(x){
# I know this next line is wrong...
z <- apply(x, 1, FindMFE, FindMFE)
return(z)
}
df1 <- data.frame(Bar1=c(1,2,3,-3,-2,-1),Bar2=c(3,1,3,-2,-3,-1))
df1 = transform(df1,
FindMAEandMFE(df1)
)
#DF1 should end up with the following data...
#Bar1 Bar2 MFE MAE
#1 3 3 0
#2 1 2 0
#3 3 3 0
#-3 -2 0 -3
#-2 -3 0 -3
#-1 -1 0 -1
It would be great to get an answer using the plyr library and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious. ;-)
Now back to the help files for me!
Edit: I would like a multivariate solution as column names may change and expand over time. It also allows re-use of the code in future.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为你在这里想得太复杂了。两个单独的
apply()
调用有什么问题?然而,有一种更好的方法可以完成您在这里所做的事情,不涉及循环/应用调用。我将分别处理这些问题,但第二种解决方案更可取,因为它是真正矢量化的。两个 apply 调用版本
前两个单独的 apply 调用使用全基 R 函数:
给出:
好的,循环 df1 的行两次可能有点低效,但即使对于您花费的大问题考虑在一次中巧妙地完成此操作所节省的时间已经比这样做节省的时间多了。
使用向量化函数
pmax()
和pmin()
因此,更好的方法是记下
pmax()
和pmin ()
函数并意识到它们可以执行每个apply(df1, 1, FindFOO()
调用正在执行的操作。例如:将是您问题中的 MFE。这非常简单如果您有两列并且它们总是
Bar1
和Bar2
或df1
的前 2 列,但如果您想要多个列怎么办?来计算这个等等?pmax(df1[, 1:2], na.rm = TRUE)
不会做我们想要的:使用
pmax( 获得通用解决方案的技巧)
和pmin()
是使用do.call()
为我们安排对这两个函数的调用,以使用我们拥有的这个想法:和
。不是
apply()
如果您想一步完成此操作,现在更容易包装:可以用作:
I think you are thinking too complex here. What is wrong with two separate
apply()
calls? There is however a far better way to do what you are doing here that involves no looping/apply calls. I'll deal with these separately, but the second solution is preferable as it is truly vectorised.Two apply calls version
First two separate apply calls using all-Base R functions:
Which gives:
Ok, looping over the rows of
df1
twice is perhaps a little inefficient, but even for big problems you've spent more time already thinking about doing this cleverly in a single pass than you will save by doing that way.Using vectorised functions
pmax()
andpmin()
So a better way of doing this is to note the
pmax()
andpmin()
functions and realise that they can do what each theapply(df1, 1, FindFOO()
calls were doing. For example:would be MFE from your Question. This is very simple to work with if you have two columns and they are
Bar1
andBar2
or the first 2 columns ofdf1
, always. But it is not very general; what if you have multiple columns you want to compute this over etc?pmax(df1[, 1:2], na.rm = TRUE)
won't do what we want:The trick to getting a general solution using
pmax()
andpmin()
is to usedo.call()
to arrange the calls to those two functions for us. Updating your functions to use this idea we have:which give:
and not an
apply()
in sight. If you want to do this in a single step, this is now much easier to wrap:which can be used as:
我展示了三种替代的单行代码:
plyr
的each
函数plyr
each
函数与基础 R 一起使用pmin
和pmax
函数解决方案 1:plyr 和each
plyr
包定义了each
函数这就是你想要的。来自?each
:将多个函数聚合为一个函数。 这意味着您可以使用单行代码解决您的问题:解决方案 2:each 和基 R
您可以,当然,将
each
与基本函数一起使用。以下是如何将其与apply
一起使用 - 请注意,在添加到原始 data.frame 之前必须转置结果。解决方案 3:使用向量化函数
使用向量化函数
pmin
和pmax
,您可以使用以下单行代码:I show three alternative one-liners:
each
function ofplyr
plyr
each
function with base Rpmin
andpmax
functions that are vectoriseSolution 1: plyr and each
The
plyr
package defines theeach
function that does what you want. From?each
: Aggregate multiple functions into a single function. This means you can solve your problem using a one-liner:Solution 2: each and base R
You can, of course, use
each
with base functions. Here is how you can use it withapply
- just note that you have to transpose the results before adding to your original data.frame.Solution 3: using vectorised functions
Using vectorised functions
pmin
andpmax
, you can use this one-liner:这里有很多好的答案。我在 Gavin Simpson 编辑时开始了这个工作,所以我们涵盖了一些类似的内容。并行最小值和最大值(pmin 和 pmax)的作用几乎正是您编写函数的目的。 pmax(0, Bar1, Bar2) 中 0 的作用可能有点不透明,但本质上 0 会被回收,所以这就像这样做,
它将获取传递的三件事中的每一项并找到它们的最大值。因此,如果 max 为负数,则 max 将为 0,并且可以完成 ifelse 语句的大部分功能。您可以重写,以便获得向量并将事物与与您正在做的类似的功能组合起来,这可能会使其更加透明。在这种情况下,我们只需将数据帧传递给一个新的并行且快速的 findMFE 函数,该函数将处理任何数字数据帧并获取向量。
该函数的作用是向传递的数据帧添加额外的 0 列,然后调用 pmax 传递 df1 的每个单独列,就好像它是一个列表一样(数据帧是列表,因此这很容易)。
现在,我注意到您实际上想要纠正数据中不在示例中的 Inf 值...我们可以向您的函数添加额外的行...
现在,这是 ifelse() 函数的正确使用一个向量。我这样做是为了给你一个例子,但 Gavin Simpson 使用 MFE[is.infinite(MFE)] <- 0 更有效。请注意,此 findMFE 函数不是在循环中使用,它只是传递了整个数据帧。
类似的 findMAE 是...
并且组合函数很简单...
MFEandMAE <- findMFEandMAE(df1)
df1 <- cbind(df1, MFEandMAE)
一些提示
如果您有标量 if 语句,请不要使用 ifelse(),请使用 if() else。在标量情况下它要快得多。而且,您的函数是标量,并且您正在尝试对它们进行矢量化。 ifelse() 已经向量化,以这种方式使用时运行速度非常快,但使用标量时比 if() else 慢得多。
另外,如果您要将内容放入循环或 apply 语句中,请尽可能少地放入其中。例如,在您的情况下,确实需要将 ifelse() 从循环中取出,然后应用于整个 MFE 结果。
There are lots of good answers here. I started this while Gavin Simpson was editing so we cover some similar ground. What the parallel min and max do (pmin and pmax) is pretty much exactly what you're writing your functions for. It may be a little opaque what the 0 does in pmax(0, Bar1, Bar2) but essentially 0 gets recycled so that's it's like doing
That will take each item of the three things passed and find the max of them. So, the max will be 0 if it was negative and accomplishes much of what your ifelse statement did. You could rewrite so you get vectors and combine things with functions similar to what you were doing and that might make it a bit more transparent. In this case we'd just pass the dataframe to a new parallel and fast findMFE function that will work with any numeric dataframe and get out a vector.
What this function does is add an extra column of 0s to the passed data frame and then call pmax passing each separate column of df1 as if it were a list (dataframes are lists so this is easy).
Now, I note that you actually want to correct for Inf values in your data that aren't in your example... we could add an extra line to your function...
Now, that's proper use of the ifelse() function on a vector. I did it that way as an example for you but Gavin Simpson's use of MFE[is.infinite(MFE)] <- 0 is more efficient. Note that this findMFE function isn't used in a loop, it's just passed the whole data frame.
The comparable findMAE is...
and the combined function is simply...
MFEandMAE <- findMFEandMAE(df1)
df1 <- cbind(df1, MFEandMAE)
Some tips
If you've got a scalar if statement don't use ifelse(), use if() else. It's much faster in scalar situations. And, your functions are scalar and you're trying to vectorize them. ifelse() is already vectorized and runs very fast when used that way but much slower than if() else when used scalar.
Also, if you're going to be putting stuff in a loop or apply statement put as little in there as possible. For example, in your case the ifelse() really needed to be taken out of the loop and applied to the whole MFE result afterwards.
如果您真的非常想要它,您可以:(
未经测试 - 它应该返回一个包含两个(我认为已命名)列和与 data.frame 一样多的行的数组)。现在你可以做:
非常恶心。请听取加文的建议。
If you really, really want it, you can:
(not tested - it should return an array with two (named, I think) columns and as many rows as the data.frame had). Now you can do:
Very icky. Please heed Gavin's advice.