使用另一个 data.frame 中的相同列在 data.frame 上应用函数并进行变异

发布于 2025-01-16 06:11:25 字数 2326 浏览 0 评论 0原文

我有两个带有来自卫星光谱带的数据帧,redDFnirDF。两个数据框的每个日期列都有以“X”开头的值,这些名称在两个数据框中相对应。 我想获得一个新的数据框,其中对于 redDF 和 nirDF 中以“X”开头的每一列,根据某个公式计算新值。

这是一个数据示例:

library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
           as.Date("2015-09-20", "%Y-%m-%d"),
           by="16 days") %>% 
  format(., "%Y-%m-%d") %>% 
  paste0("X", .)
# sample data values 
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices

# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
           mydata[sample(1:length(mydata))],
           mydata[sample(1:length(mydata))]) %>% 
  t() %>% 
  as.data.frame(., row.names = FALSE) %>% 
  rename_with(~datecolnames) %>% 
  mutate(id = row_number()+1142) %>% 
  select(id, everything())

# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>% 
  mutate(across(-id,~as.integer(.x+20*1.8))) %>% 
  select(id, everything())

> nirDF
    id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143          NA         645          NA         636         569         841         706
2 1144        1025          NA         706         569         354          NA          NA
3 1145         904         636         706         645          NA          NA         115
  X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1         115        1025         904          NA         409         354
2         115         636         409         645         841         904
3         569         409         354         841        1025          NA

这是公式:

getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)} 

我希望我能够执行类似的操作:

ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))

但这不起作用,因为 dplyr 不知道 nir< getNDVI 的 /code> 参数应该是。我已经看到通过使用 $COLNAME 索引器访问 mutate() 中其他数据帧的解决方案,但由于我有 197 列,所以这里不提供这种选择。

I have two data frames with spectral bands from a satellite, redDF and nirDF. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF and nirDF a new value is calculated according to some formula.

Here is a data sample:

library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
           as.Date("2015-09-20", "%Y-%m-%d"),
           by="16 days") %>% 
  format(., "%Y-%m-%d") %>% 
  paste0("X", .)
# sample data values 
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices

# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
           mydata[sample(1:length(mydata))],
           mydata[sample(1:length(mydata))]) %>% 
  t() %>% 
  as.data.frame(., row.names = FALSE) %>% 
  rename_with(~datecolnames) %>% 
  mutate(id = row_number()+1142) %>% 
  select(id, everything())

# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>% 
  mutate(across(-id,~as.integer(.x+20*1.8))) %>% 
  select(id, everything())

> nirDF
    id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143          NA         645          NA         636         569         841         706
2 1144        1025          NA         706         569         354          NA          NA
3 1145         904         636         706         645          NA          NA         115
  X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1         115        1025         904          NA         409         354
2         115         636         409         645         841         904
3         569         409         354         841        1025          NA

and this is the formula:

getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)} 

I hoped I would be able to do something like:

ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))

But that doesn't work, as dplyr doesn't know what the nir argument of getNDVI should be. I have seen solutions for accessing other data frames in mutate() by using the $COLNAME indexer, but since I have 197 columns, that is not an option here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

很糊涂小朋友 2025-01-23 06:11:25

我会用 for 循环来解决这个问题,尽管我知道它并没有充分利用像 across 这样的功能。

首先,我们创建要迭代的列的列表:

cols_to_iterate_over = redDF %>%
  select(starts_with("X") %>%
  colnames()

然后我们加入 id 并确保根据源数据集命名列:

joined_df = redDF %>%
  inner_join(nirDF, by = "id", prefix = c("_red","_nir"))

因此 joined_df 应该具有如下所示的列:

id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...

然后我们可以循环这些:

for(col in cols_to_iterate_over){
  # columns for calculation
  red_col = paste0(col,"_red") %>% sym()
  nir_col = paste0(col,"_nir") %>% sym()
  out_col = col %>% sym()
  
# calculate
  joined_df = joined_df %>%
    mutate(
      !!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
                         digits = 4)
    ) %>%
    select(-!!red_col, -!!nir_col)
}

说明:如果我们将文本字符串转换为符号,然后 !! 它们,我们可以使用文本字符串作为变量名。

  • sym() 将文本转换为符号,
  • dplyr 命令中的 !! 将符号转换为代码,
  • := 相当于 =< /code> 但允许我们在左侧添加 !!

抱歉,这是有点旧的语法。有关当前方法,请参阅使用 dplyr 进行编程

I would approach this with a for loop, though I know it does not make best use of functionality like across.

First we create a list of the columns we want to iterate over:

cols_to_iterate_over = redDF %>%
  select(starts_with("X") %>%
  colnames()

Then we join on id and ensure columns are named according to source dataset:

joined_df = redDF %>%
  inner_join(nirDF, by = "id", prefix = c("_red","_nir"))

So joined_df should have columns like:

id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...

Then we can loop over these:

for(col in cols_to_iterate_over){
  # columns for calculation
  red_col = paste0(col,"_red") %>% sym()
  nir_col = paste0(col,"_nir") %>% sym()
  out_col = col %>% sym()
  
# calculate
  joined_df = joined_df %>%
    mutate(
      !!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
                         digits = 4)
    ) %>%
    select(-!!red_col, -!!nir_col)
}

Explanation: We can use text strings as variable names if we turn them into symbols and then !! them.

  • sym() turns text into symbols,
  • !! inside dplyr commands turns symbols into code,
  • and := is equivalent to = but permits us to have !! on the left-hand side.

Sorry, this is slightly old syntax. For the current approaches see programming with dplyr.

烟─花易冷 2025-01-23 06:11:25

在最基本的形式中,您可以这样做:

round((nirDF - redDF)/(nirDF + redDF), digits = 4)

但这不会保留 id 列,并且如果某些列不是数字,则可能会中断。一个更安全的版本是:

red <- redDF %>% 
  arrange(id) %>%  # be sure to apply the same order everywhere
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric)) # be sure to have numeric columns 
nir <- nirDF %>% arrange(id) %>% 
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric))

# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
  ndvi <- redDF %>% 
    # get data.frame with ndvi values
    transmute(round((nir - red)/(nir + red), digits = 4)) %>% 
    # bind id-column and possibly other columns to the data frame
    bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>% 
    # place the id-column to the front
    select(!starts_with('X'), everything())
}

据我现在对 dplyr 的理解,它可以归结为:

  • across(通常)意味着多对多关系,但默认情况下会单独处理列。因此,如果您给它三列,它会返回三列,而它们不知道其他列中的值。
  • 另一方面,c_across 可以评估列之间的关系(如总和或标准差),但适用于多对一关系。换句话说,如果您给它三列,它就会返回给您一列。

这些都不适合这项任务。然而,根据设计,算术运算可以应用于 R 中的数据帧(例如,尝试 cars*cars)。这就是我们在本例中所需要的。幸运的是,这些操作不像 dplyr 连接操作那样贪婪,因此它们可以在大型数据帧上高效地完成。
这样做时,您需要考虑一些要求:

  • 两个数据帧的行数应该相等,否则,较短的数据帧将被回收。
  • 数据框中的所有列都必须属于数字类(numericinteger)。

In its most basic form, you can just do this:

round((nirDF - redDF)/(nirDF + redDF), digits = 4)

But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:

red <- redDF %>% 
  arrange(id) %>%  # be sure to apply the same order everywhere
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric)) # be sure to have numeric columns 
nir <- nirDF %>% arrange(id) %>% 
  select(starts_with('X')) %>%  
  mutate(across(everything(), as.numeric))

# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
  ndvi <- redDF %>% 
    # get data.frame with ndvi values
    transmute(round((nir - red)/(nir + red), digits = 4)) %>% 
    # bind id-column and possibly other columns to the data frame
    bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>% 
    # place the id-column to the front
    select(!starts_with('X'), everything())
}

As far as I have understood dplyr by now, it boils down to this:

  • across is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.
  • c_across on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.

Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.
While doing so, you need to keep some requirements into account:

  • The number of rows of the two data frames should be equal, otherwise, the shorter data frame will get recycled.
  • all columns in the data frame need to be of a numeric class (numeric or integer).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文