使用另一个 data.frame 中的相同列在 data.frame 上应用函数并进行变异
我有两个带有来自卫星光谱带的数据帧,redDF
和 nirDF
。两个数据框的每个日期列都有以“X”开头的值,这些名称在两个数据框中相对应。 我想获得一个新的数据框,其中对于 redDF 和 nirDF 中以“X”开头的每一列,根据某个公式计算新值。
这是一个数据示例:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
这是公式:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
我希望我能够执行类似的操作:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
但这不起作用,因为 dplyr
不知道 nir<
getNDVI
的 /code> 参数应该是。我已经看到通过使用 $COLNAME
索引器访问 mutate()
中其他数据帧的解决方案,但由于我有 197 列,所以这里不提供这种选择。
I have two data frames with spectral bands from a satellite, redDF
and nirDF
. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF
and nirDF
a new value is calculated according to some formula.
Here is a data sample:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
and this is the formula:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
I hoped I would be able to do something like:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
But that doesn't work, as dplyr
doesn't know what the nir
argument of getNDVI
should be. I have seen solutions for accessing other data frames in mutate()
by using the $COLNAME
indexer, but since I have 197 columns, that is not an option here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我会用 for 循环来解决这个问题,尽管我知道它并没有充分利用像
across
这样的功能。首先,我们创建要迭代的列的列表:
然后我们加入
id
并确保根据源数据集命名列:因此
joined_df
应该具有如下所示的列:然后我们可以循环这些:
说明:如果我们将文本字符串转换为符号,然后
!!
它们,我们可以使用文本字符串作为变量名。sym()
将文本转换为符号,!!
将符号转换为代码,:=
相当于=< /code> 但允许我们在左侧添加
!!
。抱歉,这是有点旧的语法。有关当前方法,请参阅使用 dplyr 进行编程。
I would approach this with a for loop, though I know it does not make best use of functionality like
across
.First we create a list of the columns we want to iterate over:
Then we join on
id
and ensure columns are named according to source dataset:So
joined_df
should have columns like:Then we can loop over these:
Explanation: We can use text strings as variable names if we turn them into symbols and then
!!
them.sym()
turns text into symbols,!!
inside dplyr commands turns symbols into code,:=
is equivalent to=
but permits us to have!!
on the left-hand side.Sorry, this is slightly old syntax. For the current approaches see programming with dplyr.
在最基本的形式中,您可以这样做:
但这不会保留 id 列,并且如果某些列不是数字,则可能会中断。一个更安全的版本是:
据我现在对 dplyr 的理解,它可以归结为:
across
(通常)意味着多对多关系,但默认情况下会单独处理列。因此,如果您给它三列,它会返回三列,而它们不知道其他列中的值。这些都不适合这项任务。然而,根据设计,算术运算可以应用于 R 中的数据帧(例如,尝试
cars*cars
)。这就是我们在本例中所需要的。幸运的是,这些操作不像 dplyr 连接操作那样贪婪,因此它们可以在大型数据帧上高效地完成。这样做时,您需要考虑一些要求:
numeric
或integer
)。In its most basic form, you can just do this:
But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:
As far as I have understood
dplyr
by now, it boils down to this:across
is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.c_across
on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try
cars*cars
for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.While doing so, you need to keep some requirements into account:
numeric
orinteger
).