如何使用变量名称通过 ddply 引用数据框列?
我正在尝试编写一个函数,该函数将保存时间序列数据的数据框的名称和该数据框中的列的名称作为参数。该函数对该数据执行各种操作,其中之一是在列中添加每年的运行总计。我正在使用plyr。
当我直接将列名与 ddply 和 cumsum 一起使用时,我没有任何问题:
require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
by = "month",
length.out = 60),
sales = runif(60, min = 700, max = 1200))
df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
cum_sales = (cumsum(as.numeric(sales))))
这一切都很好,但最终目标是能够将列名传递给此函数。当我尝试使用变量代替列名时,它无法按我的预期工作:
mycol <- "sales"
df[mycol]
df <- ddply(df, .(year), transform,
cum_value2 = cumsum(as.numeric(df[mycol])))
我以为我知道如何按名称访问列。这让我很担心,因为这表明我未能理解有关索引和提取的基本知识。我本以为以这种方式按名称引用列将是一种常见的需求。
我有两个问题。
- 我做错了什么,即我误解了什么?
- 考虑到函数不会事先知道列的名称,是否有更好的方法来解决这个问题?
TIA
I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr.
When I use the name of the column directly with ddply and cumsum I have no problems:
require(plyr)
df <- data.frame(date = seq(as.Date("2007/1/1"),
by = "month",
length.out = 60),
sales = runif(60, min = 700, max = 1200))
df$year <- as.numeric(format(as.Date(df$date), format="%Y"))
df <- ddply(df, .(year), transform,
cum_sales = (cumsum(as.numeric(sales))))
This is all well and good but the ultimate aim is to be able to pass a column name to this function. When I try to use a variable in place of the column name, it doesn't work as I expected:
mycol <- "sales"
df[mycol]
df <- ddply(df, .(year), transform,
cum_value2 = cumsum(as.numeric(df[mycol])))
I thought I knew how to access columns by name. This worries me because it suggests that I have failed to understand something basic about indexing and extraction. I would have thought that referring to columns by name in this way would be a common need.
I have two questions.
- What am I doing wrong i.e. what have I misunderstood?
- Is there a better way of going about this, bearing in mind that the names of the columns will not be known beforehand by the function?
TIA
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
ddply 的参数是在原始数据帧分割成的每个部分的上下文中计算的表达式。你的 df[myval] 寻址整个数据帧,所以你不能按原样传递它(顺便说一句,为什么你需要那些 as.numeric(as.character()) 东西 - 它们完全没用)。
最简单的方法是编写自己的函数,该函数将执行内部的所有操作并向下传递列名称,例如
The arguments to ddply are expressions which are evaluated in the context of the each part the original data frame is split into. Your df[myval] addresses the whole data frame, so you cannot pass it as-is (btw, why do you need those as.numeric(as.character()) stuff - they are completely useless).
The easiest way will be to write your own function which will does everything inside and pass the column name down, e.g.
问题是 ddply 期望它的最后一个参数是表达式,它将在 data.frame 的块上进行评估(在您的示例中,每年)。
如果您使用 df[myval],您将拥有整个 data.frame,而不是年度块。
以下方法可以工作,但不是很优雅:我将表达式构建为字符串,然后使用
eval(parse(...))
进行转换。The problem is that
ddply
expects its last arguments to be expressions, that will be evaluated on chunks of the data.frame (every year, in your example).If you use
df[myval]
, you have the whole data.frame, not the annual chunks.The following works, but is not very elegant: I build the expression as a string, and then convert it with
eval(parse(...))
.