将 data.frame 列名称传递给函数

发布于 2024-08-29 06:52:40 字数 1032 浏览 4 评论 0原文

我正在尝试编写一个函数来接受 data.frame (x) 和其中的 column 。该函数对 x 执行一些计算,然后返回另一个 data.frame。我坚持使用最佳实践方法将列名称传递给函数。

下面的两个最小示例 fun1fun2 产生了所需的结果,能够使用 max( ) 为例。然而,两者都依赖于看似(至少对我来说)不优雅的

  1. substitute() 调用,并且可能依赖于 eval()
  2. 将列名作为字符向量传递的需要。

fun1 <- function(x, column){
  do.call("max", list(substitute(x[a], list(a = column))))
}

fun2 <- function(x, column){
  max(eval((substitute(x[a], list(a = column)))))
}

df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")

例如,我希望能够将该函数调用为 fun(df, B)。我考虑过但没有尝试过的其他选项:

  • column 作为列号的整数传递。我认为这会避免 substitute()。理想情况下,该函数可以接受其中任何一个。
  • with(x, get(column)),但是,即使它有效,我认为这仍然需要substitute
  • 利用formula()match.call(),我对这两个都没有太多经验。

子问题do.call() 是否优于 eval()

I'm trying to write a function to accept a data.frame (x) and a column from it. The function performs some calculations on x and later returns another data.frame. I'm stuck on the best-practices method to pass the column name to the function.

The two minimal examples fun1 and fun2 below produce the desired result, being able to perform operations on x$column, using max() as an example. However, both rely on the seemingly (at least to me) inelegant

  1. call to substitute() and possibly eval()
  2. the need to pass the column name as a character vector.
fun1 <- function(x, column){
  do.call("max", list(substitute(x[a], list(a = column))))
}

fun2 <- function(x, column){
  max(eval((substitute(x[a], list(a = column)))))
}

df <- data.frame(B = rnorm(10))
fun1(df, "B")
fun2(df, "B")

I would like to be able to call the function as fun(df, B), for example. Other options I have considered but have not tried:

  • Pass column as an integer of the column number. I think this would avoid substitute(). Ideally, the function could accept either.
  • with(x, get(column)), but, even if it works, I think this would still require substitute
  • Make use of formula() and match.call(), neither of which I have much experience with.

Subquestion: Is do.call() preferred over eval()?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

人间☆小暴躁 2024-09-05 06:52:40

这个答案将涵盖许多与现有答案相同的元素,但是这个问题(将列名称传递给函数)经常出现,我希望有一个答案能够更全面地涵盖一些内容。

假设我们有一个非常简单的数据框:

dat <- data.frame(x = 1:4,
                  y = 5:8)

我们想编写一个函数来创建一个新列 z,它是列 xy 的总和

这里一个非常常见的障碍是,自然(但不正确)的尝试通常如下所示:

foo <- function(df,col_name,col1,col2){
      df$col_name <- df$col1 + df$col2
      df
}

#Call foo() like this:    
foo(dat,z,x,y)

这里的问题是 df$col1 不计算表达式 col1。它只是在 df 中查找字面上名为 col1 的列。此行为在“递归(类似列表)对象”部分下的 ?Extract 中进行了描述。

最简单且最常推荐的解决方案是简单地从 $ 切换到 [[ 并将函数参数作为字符串传递:

new_column1 <- function(df,col_name,col1,col2){
    #Create new column col_name as sum of col1 and col2
    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column1(dat,"z","x","y")
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

这通常被认为是“最佳实践”,因为它是最难搞砸的方法。将列名作为字符串传递是尽可能明确的。

以下两个选项更高级。许多流行的软件包都使用这些类型的技术,但使用它们需要更多的小心和技巧,因为它们可能会引入微妙的复杂性和意外的故障点。 Hadley 的《高级 R》一书中的部分是一个很好的参考资料对于其中一些问题。

如果您确实希望避免用户输入所有这些引号,一种选择可能是使用deparse(substitute())将裸露的、不带引号的列名转换为字符串

new_column2 <- function(df,col_name,col1,col2){
    col_name <- deparse(substitute(col_name))
    col1 <- deparse(substitute(col1))
    col2 <- deparse(substitute(col2))

    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column2(dat,z,x,y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

:坦率地说,这可能有点愚蠢,因为我们实际上在做与 new_column1 中相同的事情,只是做了一堆额外的工作将裸名称转换为字符串。

最后,如果我们想要变得真正,我们可能会决定不传递要添加的两列的名称,而是希望更加灵活并允许两个变量的其他组合。在这种情况下,我们可能会在涉及两列的表达式上使用 eval()

new_column3 <- function(df,col_name,expr){
    col_name <- deparse(substitute(col_name))
    df[[col_name]] <- eval(substitute(expr),df,parent.frame())
    df
}

只是为了好玩,我仍在使用 deparse(substitute())作为新列的名称。在这里,以下所有内容都将起作用:

> new_column3(dat,z,x+y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
  x y  z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
  x y  z
1 1 5  5
2 2 6 12
3 3 7 21
4 4 8 32

因此简短的答案基本上是:将 data.frame 列名称作为字符串传递并使用 [[ 选择单个列。只有当您真正知道自己在做什么时,才开始深入研究evalsubstitute等。

This answer will cover many of the same elements as existing answers, but this issue (passing column names to functions) comes up often enough that I wanted there to be an answer that covered things a little more comprehensively.

Suppose we have a very simple data frame:

dat <- data.frame(x = 1:4,
                  y = 5:8)

and we'd like to write a function that creates a new column z that is the sum of columns x and y.

A very common stumbling block here is that a natural (but incorrect) attempt often looks like this:

foo <- function(df,col_name,col1,col2){
      df$col_name <- df$col1 + df$col2
      df
}

#Call foo() like this:    
foo(dat,z,x,y)

The problem here is that df$col1 doesn't evaluate the expression col1. It simply looks for a column in df literally called col1. This behavior is described in ?Extract under the section "Recursive (list-like) Objects".

The simplest, and most often recommended solution is simply switch from $ to [[ and pass the function arguments as strings:

new_column1 <- function(df,col_name,col1,col2){
    #Create new column col_name as sum of col1 and col2
    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column1(dat,"z","x","y")
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

This is often considered "best practice" since it is the method that is hardest to screw up. Passing the column names as strings is about as unambiguous as you can get.

The following two options are more advanced. Many popular packages make use of these kinds of techniques, but using them well requires more care and skill, as they can introduce subtle complexities and unanticipated points of failure. This section of Hadley's Advanced R book is an excellent reference for some of these issues.

If you really want to save the user from typing all those quotes, one option might be to convert bare, unquoted column names to strings using deparse(substitute()):

new_column2 <- function(df,col_name,col1,col2){
    col_name <- deparse(substitute(col_name))
    col1 <- deparse(substitute(col1))
    col2 <- deparse(substitute(col2))

    df[[col_name]] <- df[[col1]] + df[[col2]]
    df
}

> new_column2(dat,z,x,y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12

This is, frankly, a bit silly probably, since we're really doing the same thing as in new_column1, just with a bunch of extra work to convert bare names to strings.

Finally, if we want to get really fancy, we might decide that rather than passing in the names of two columns to add, we'd like to be more flexible and allow for other combinations of two variables. In that case we'd likely resort to using eval() on an expression involving the two columns:

new_column3 <- function(df,col_name,expr){
    col_name <- deparse(substitute(col_name))
    df[[col_name]] <- eval(substitute(expr),df,parent.frame())
    df
}

Just for fun, I'm still using deparse(substitute()) for the name of the new column. Here, all of the following will work:

> new_column3(dat,z,x+y)
  x y  z
1 1 5  6
2 2 6  8
3 3 7 10
4 4 8 12
> new_column3(dat,z,x-y)
  x y  z
1 1 5 -4
2 2 6 -4
3 3 7 -4
4 4 8 -4
> new_column3(dat,z,x*y)
  x y  z
1 1 5  5
2 2 6 12
3 3 7 21
4 4 8 32

So the short answer is basically: pass data.frame column names as strings and use [[ to select single columns. Only start delving into eval, substitute, etc. if you really know what you're doing.

寂寞美少年 2024-09-05 06:52:40

您可以直接使用列名称:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))

无需使用 Replace、eval 等。

您甚至可以将所需的函数作为参数传递:

fun1 <- function(x, column, fn) {
  fn(x[,column])
}
fun1(df, "B", max)

或者,使用 [[ 也适用于选择单个列一次:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[[column]])
}
fun1(df, "B")

You can just use the column name directly:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[,column])
}
fun1(df, "B")
fun1(df, c("B","A"))

There's no need to use substitute, eval, etc.

You can even pass the desired function as a parameter:

fun1 <- function(x, column, fn) {
  fn(x[,column])
}
fun1(df, "B", max)

Alternatively, using [[ also works for selecting a single column at a time:

df <- data.frame(A=1:10, B=2:11, C=3:12)
fun1 <- function(x, column){
  max(x[[column]])
}
fun1(df, "B")
来日方长 2024-09-05 06:52:40

我个人认为将列作为字符串传递是非常难看的。我喜欢做类似的事情:

get.max <- function(column,data=NULL){
    column<-eval(substitute(column),data, parent.frame())
    max(column)
}

这将产生:

> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5

注意 data.frame 的规范是可选的。您甚至可以使用列的函数:

> get.max(1/mpg,mtcars)
[1] 0.09615385

Personally I think that passing the column as a string is pretty ugly. I like to do something like:

get.max <- function(column,data=NULL){
    column<-eval(substitute(column),data, parent.frame())
    max(column)
}

which will yield:

> get.max(mpg,mtcars)
[1] 33.9
> get.max(c(1,2,3,4,5))
[1] 5

Notice how the specification of a data.frame is optional. you can even work with functions of your columns:

> get.max(1/mpg,mtcars)
[1] 0.09615385
盛夏已如深秋| 2024-09-05 06:52:40

借助 dplyr,现在只需在函数体内所需的列名称周围使用双大括号 {{...}} 即可访问数据帧的特定列,例如 col_name

library(tidyverse)

fun <- function(df, col_name){
   df %>% 
     filter({{col_name}} == "test_string")
} 

With dplyr it's now also possible to access a specific column of a dataframe by simply using double curly braces {{...}} around the desired column name within the function body, e.g. for col_name:

library(tidyverse)

fun <- function(df, col_name){
   df %>% 
     filter({{col_name}} == "test_string")
} 
南风起 2024-09-05 06:52:40

另一种方法是使用整洁评估方法。将数据帧的列作为字符串或裸列名称传递非常简单。 此处了解有关 tidyeval 的更多信息。

library(rlang)
library(tidyverse)

set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))

使用列名称作为字符串

fun3 <- function(x, ...) {
  # capture strings and create variables
  dots <- ensyms(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun3(df, "B")
#>          B
#> 1 1.715065

fun3(df, "B", "D")
#>          B        D
#> 1 1.715065 1.786913

使用裸列名称

fun4 <- function(x, ...) {
  # capture expressions and create quosures
  dots <- enquos(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun4(df, B)
#>          B
#> 1 1.715065

fun4(df, B, D)
#>          B        D
#> 1 1.715065 1.786913
#>

reprex 包于 2019 年 3 月 1 日创建(v0 .2.1.9000)

Another way is to use tidy evaluation approach. It is pretty straightforward to pass columns of a data frame either as strings or bare column names. See more about tidyeval here.

library(rlang)
library(tidyverse)

set.seed(123)
df <- data.frame(B = rnorm(10), D = rnorm(10))

Use column names as strings

fun3 <- function(x, ...) {
  # capture strings and create variables
  dots <- ensyms(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun3(df, "B")
#>          B
#> 1 1.715065

fun3(df, "B", "D")
#>          B        D
#> 1 1.715065 1.786913

Use bare column names

fun4 <- function(x, ...) {
  # capture expressions and create quosures
  dots <- enquos(...)
  # unquote to evaluate inside dplyr verbs
  summarise_at(x, vars(!!!dots), list(~ max(., na.rm = TRUE)))
}

fun4(df, B)
#>          B
#> 1 1.715065

fun4(df, B, D)
#>          B        D
#> 1 1.715065 1.786913
#>

Created on 2019-03-01 by the reprex package (v0.2.1.9000)

花桑 2024-09-05 06:52:40

Tung 的回答mgrund 的回答< /a> 提出了整洁的评估。在这个答案中,我将展示如何使用这些概念来完成类似于 joran 的答案 的事情(特别是他的函数 new_column3)。这样做的目的是为了更容易地看到基本评估和 tidy 评估之间的差异,以及查看 tidy 评估中可以使用的不同语法。为此,您将需要 rlangdplyr

使用基本评估工具(joran 的答案):

new_column3 <- function(df,col_name,expr){
  col_name <- deparse(substitute(col_name))
  df[[col_name]] <- eval(substitute(expr),df,parent.frame())
  df
}

在第一行中,substitute 使我们将 col_name 评估为表达式,更具体地说是一个符号(有时也称为名称),不是一个物体。 rlang 的替代品可以是:

  • ensym - 将其转换为符号;
  • enexpr - 将其转换为表达式;
  • enquo - 将其转换为 quosure,一个表达式,它还指向 R 应该查找变量以对其求值的环境。

大多数时候,您希望拥有指向环境的指针。当您不是特别需要它时,拥有它很少会引起问题。因此,大多数时候您可以使用enquo。在这种情况下,您可以使用 ensym 使代码更易于阅读,因为它可以更清楚地说明 col_name 是什么。

同样在第一行中,deparse 将表达式/符号转换为字符串。您还可以使用 as.characterrlang::as_string

在第二行中,substituteexpr 转换为“完整”表达式(不是符号),因此 ensym 不再是一个选项。

同样在第二行中,我们现在可以将 eval 更改为 rlang::eval_tidy。 Eval 仍然可以与 enexpr 一起使用,但不能与 quosure 一起使用。当您有定论时,您不需要将环境传递给评估函数(就像 joran 对 parent.frame() 所做的那样)。

上面建议的替换的一种组合可能是:

new_column3 <- function(df,col_name,expr){
  col_name <- as_string(ensym(col_name))
  df[[col_name]] <- eval_tidy(enquo(expr), df)
  df
}

我们还可以使用 dplyr 运算符,它允许数据屏蔽(将数据框中的列评估为变量,通过其名称调用它)。我们可以改变将符号转换为字符+取子集df的方法,使用[[mutate

new_column3 <- function(df,col_name,expr){
  col_name <- ensym(col_name)
  df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}

以避免新列被命名为“ col_name”,我们使用 bang-bang !! 运算符对其进行焦急求值(与 R 的默认值惰性求值相反)。因为我们对左侧进行了操作,所以我们不能使用“正常”=,而必须使用新语法:=

将列名转换为符号,然后用 bang-bang 对其求值的常见操作有一个快捷方式:curly-curly {{ 运算符:

new_column3 <- function(df,col_name,expr){
  df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}

我不是 R 求值专家并且可能过度简化,或者使用了错误的术语,所以请在评论中纠正我。我希望能够帮助比较这个问题的答案中使用的不同工具。

Tung's answer and mgrund's answer presented tidy evaluation. In this answer I'll show how we can use these concepts to do something similar to joran's answer (specifically his function new_column3). The objective to this is to make it easier to see the differences between base evaluation and tidy one, and also to see the different syntaxes that can be used in tidy evaluation. You will need rlang and dplyr for this.

Using base evaluation tools (joran's answer):

new_column3 <- function(df,col_name,expr){
  col_name <- deparse(substitute(col_name))
  df[[col_name]] <- eval(substitute(expr),df,parent.frame())
  df
}

In the first line, substitute is making us evaluate col_name as an expression, more specifically a symbol (also sometimes called a name), not an object. rlang's substitutes can be:

  • ensym - turns it into a symbol;
  • enexpr - turns it into a expression;
  • enquo - turns it into a quosure, an expression that also points the environment where R should look for the variables to evaluate it.

Most of the time, you want to have that pointer to the environment. When you don't specifically need it, having it rarely causes problems. Thus, most of the time you can use enquo. In this case, you can use ensym to make the code easier to read, as it makes it clearer what col_name is.

Also in the first line, deparse is turning the expression/symbol into a string. You could also use as.character or rlang::as_string.

In the second line, the substitute is turning expr into a 'full' expression (not a symbol), so ensym is not an option anymore.

Also in the second line, we can now change eval to rlang::eval_tidy. Eval would still work with enexpr, but not with a quosure. When you have a quosure, you don't need to pass the environment to the evaluation function (as joran did with parent.frame()).

One combination of the substitutions suggested above might be:

new_column3 <- function(df,col_name,expr){
  col_name <- as_string(ensym(col_name))
  df[[col_name]] <- eval_tidy(enquo(expr), df)
  df
}

We can also use the dplyr operators, which allow for data-masking (evaluating a column in a data frame as a variable, calling it by its name). We can change the method of transforming the symbol to character + subsetting df using [[ with mutate:

new_column3 <- function(df,col_name,expr){
  col_name <- ensym(col_name)
  df %>% mutate(!!col_name := eval_tidy(enquo(expr), df))
}

To avoid the new column to be named "col_name", we anxious-evaluate it (as opposed to lazy-evaluate, the default of R) with the bang-bang !! operator. Because we made an operation to the left hand side, we can't use 'normal' =, and must use the new syntax :=.

The common operation of turning a column name into a symbol, then anxious-evaluating it with bang-bang has a shortcut: the curly-curly {{ operator:

new_column3 <- function(df,col_name,expr){
  df %>% mutate({{col_name}} := eval_tidy(enquo(expr), df))
}

I'm not an expert in evaluation in R and might have done an over simplification, or used a wrong term, so please correct me in the comments. I hope to have helped in comparing the different tools used in the answers to this question.

遗心遗梦遗幸福 2024-09-05 06:52:40

作为一个额外的想法,如果需要将不带引号的列名传递给自定义函数,也许 match.call() 在这种情况下也很有用,作为 deparse( 的替代方案) Replace())

df <- data.frame(A = 1:10, B = 2:11)

fun <- function(x, column){
  arg <- match.call()
  max(x[[arg$column]])
}

fun(df, A)
#> [1] 10

fun(df, B)
#> [1] 11

如果列名称中存在拼写错误,则更安全的做法是停止并显示错误:

fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf

# Stop with error in case of typo
fun <- function(x, column){
  arg <- match.call()
  if (is.null(x[[arg$column]])) stop("Wrong column name")
  max(x[[arg$column]])
}

fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10

reprex 包 (v0.2.1)

我认为我不会使用这种方法,因为除了传递引用的列之外还有额外的输入和复杂性正如上面答案中指出的那样,但是,这是一种方法。

As an extra thought, if is needed to pass the column name unquoted to the custom function, perhaps match.call() could be useful as well in this case, as an alternative to deparse(substitute()):

df <- data.frame(A = 1:10, B = 2:11)

fun <- function(x, column){
  arg <- match.call()
  max(x[[arg$column]])
}

fun(df, A)
#> [1] 10

fun(df, B)
#> [1] 11

If there is a typo in the column name, then would be safer to stop with an error:

fun <- function(x, column) max(x[[match.call()$column]])
fun(df, typo)
#> Warning in max(x[[match.call()$column]]): no non-missing arguments to max;
#> returning -Inf
#> [1] -Inf

# Stop with error in case of typo
fun <- function(x, column){
  arg <- match.call()
  if (is.null(x[[arg$column]])) stop("Wrong column name")
  max(x[[arg$column]])
}

fun(df, typo)
#> Error in fun(df, typo): Wrong column name
fun(df, A)
#> [1] 10

Created on 2019-01-11 by the reprex package (v0.2.1)

I do not think I would use this approach since there is extra typing and complexity than just passing the quoted column name as pointed in the above answers, but well, is an approach.

猫性小仙女 2024-09-05 06:52:40

如果您尝试在 R 包中构建此函数或只是想降低复杂性,则可以执行以下操作:

test_func <- function(df, column) {
  if (column %in% colnames(df)) {
    return(max(df[, column, with=FALSE])) 
  } else {
    stop(cat(column, "not in data.frame columns."))
  }
}

参数 with=FALSE “禁用引用列的能力,就好像它们是变量一样,从而恢复“data.frame 模式”(根据 CRAN 文档)。if 语句是一种快速捕获提供的列名是否在 data.frame 内的方法。还可以在此处使用 tryCatch 错误处理。

If you are trying to build this function within an R package or simply want to reduce complexity, you can do the following:

test_func <- function(df, column) {
  if (column %in% colnames(df)) {
    return(max(df[, column, with=FALSE])) 
  } else {
    stop(cat(column, "not in data.frame columns."))
  }
}

The argument with=FALSE "disables the ability to refer to columns as if they are variables, thereby restoring the “data.frame mode” (per CRAN documentation). The if statement is a quick way to catch if the column name provided is within the data.frame. Could also use tryCatch error handling here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文