R：使用突变（）逐行应用自定义功能

发布于 2025-02-12 06:46:10 字数 1554 浏览 1 评论 0原文

我创建了一个函数，该函数使用sf软件包中使用st_join（）来从一组纬度和经度坐标中提取国会区（polygon），并使用不同的shapefile根据指定的“国会”论点确定国会区。（这是必要的，因为区域会定期重新绘制，因此边界会随着时间的推移而发生变化。）下一步是通过行将功能行应用到包含多行坐标（以及相关的“国会”值）的数据框架上，以便国会给定行的值确定要使用的shapefile，然后将提取的区分配给新变量。

我在应用此功能逐行时遇到麻烦。我首先尝试使用rowwise（）和mutate（）从dplyr中函数，但获得了“必须是大小1”的错误。基于此问题的评论，我将list（）围绕mutate（）函数分配的变量，但这导致新变量是列表一个字符串。

我非常感谢帮助弄清楚（i）修改函数的方法，以便可以使用rowwise（） and mutate（）或（ ii）以其他方式应用我的功能。

可重现的代码如下；您只需要从

library(tidyverse)
library(sf)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111)
latitude <- c(37.32935, 37.32935)
longitude <- c(-122.00954, -122.00954)
df_test <- data.frame(congress, latitude, longitude)

point_geo_test <- st_as_sf(df_test,
                             coords = c(x = "longitude", y = "latitude"),
                             crs = st_crs(districts_104)) # prep for st_join()

sf_use_s2(FALSE) # preempt evaluation error that would otherwise pop up when using the st_join function

extract_district <- function(points, cong) {
  shapefile <- get(paste0("districts_", cong))
  st_join_results <- st_join(points, shapefile, join = st_within)
  paste(st_join_results$STATENAME, st_join_results$DISTRICT, sep = "-")
}

point_geo_test <- point_geo_test %>%
  rowwise %>%
  mutate(district = list(extract_district(points = point_geo_test, cong = congress)))

原文

I have created a function that uses st_join() from the sf package to extract the congressional district (a polygon) from a set of latitude and longitude coordinates, using a different shapefile to identify the congressional district depending on a "congress" argument that is specified. (This is necessary because districts are periodically redrawn, so the boundaries change over time.) The next step is to apply the function row by row to a data frame containing multiple rows of coordinates (and associated "congress" values) so that the congress value for a given row determines which shapefile to use, and then assign the extracted district to a new variable.

I'm running into trouble applying this function row-by-row. I first tried using the rowwise() and mutate() functions from dplyr, but got a "must be size 1" error. Based on the comments to this question, I put list() around the variable assigned inside the mutate() function, but this has resulted in the new variable being a list instead a single character string.

I would greatly appreciate help figuring out a way to either (i) modify the function so that it can be applied row by row using rowwise() and mutate() or (ii) apply my function row-by-row in some other way.

Reproducible code is below; you just need to download two shapefiles from https://cdmaps.polisci.ucla.edu/ ("districts104.zip" and "districts111.zip"), unzip them, and put them in your working directory.

library(tidyverse)
library(sf)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111)
latitude <- c(37.32935, 37.32935)
longitude <- c(-122.00954, -122.00954)
df_test <- data.frame(congress, latitude, longitude)

point_geo_test <- st_as_sf(df_test,
                             coords = c(x = "longitude", y = "latitude"),
                             crs = st_crs(districts_104)) # prep for st_join()

sf_use_s2(FALSE) # preempt evaluation error that would otherwise pop up when using the st_join function

extract_district <- function(points, cong) {
  shapefile <- get(paste0("districts_", cong))
  st_join_results <- st_join(points, shapefile, join = st_within)
  paste(st_join_results$STATENAME, st_join_results$DISTRICT, sep = "-")
}

point_geo_test <- point_geo_test %>%
  rowwise %>%
  mutate(district = list(extract_district(points = point_geo_test, cong = congress)))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吾性傲以野 2025-02-19 06:46:12

Edit 7 July:

From your comments I understand you were looking for something different, the assumption I made about why your function was giving multiple values was wrong. Hence this new answer from scratch:

The custom function you've written doesn't lend itself to row-by-row application, because it already processes all rows at once:

Given the following input:

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test contains these values:

> point_geo_test
[...]
  congress                   geometry
1      104 POINT (-122.0095 37.32935)
2      111 POINT (-122.0095 37.32935)
3      104   POINT (73.72036 41.1134)
4      111   POINT (73.72036 41.1134)
5      104 POINT (-87.86885 42.15549)
6      111 POINT (-87.86885 42.15549)

and extract_district() returns this:

> extract_district(point_geo_test, 104)
[...]
[1] "California-14" "California-14" "NA-NA"         "NA-NA"         "Illinois-10"   "Illinois-10"

This is already a result for each row.唯一的问题是，虽然它们是每行的坐标的正确结果，但它们仅在国会104期间仅用于这些坐标的 name 。因此，这些值仅是有效的for the rows in point_geo_test where congress == 104.

Extracting correct values for all rows

We will create a function that returns the correct data for all rows, eg the correct name for the在相关国会期间进行坐标。

我已经稍微简化了您的代码：df_test不再是中间数据框架，而是直接定义在point_geo_test的创建中。我提取的任何值，我也将保存到此数据框架中。

library(tidyverse)
library(sf)
sf_use_s2(FALSE)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test <- st_as_sf(data.frame(congress, latitude, longitude),
                           coords = c(x = "longitude", y = "latitude"),
                           crs = st_crs(districts_104))

To keep the code more flexible and organized, I'll create a generic function that can fetch any parameter for the given coordinates:

extract_values <- function(points, parameter) {
  # initialize return values, one for each row in `points`
  values <- rep(NA, nrow(points))
  
  # for each congress present in `points`, lookup parameter and store in the rows with matching congress
  for(cong in unique(points$congress)) {
    shapefile <- get(paste0("districts_", cong))
    st_join_results <- st_join(points, shapefile, join = st_within)
    values[points$congress == cong] <- st_join_results[[parameter]][points$congress == cong]
  }
  
  return(values)
}

Examples:

> extract_values(point_geo_test, 'STATENAME')
[1] "California" "California" NA           NA           "Illinois"   "Illinois"  
> extract_values(point_geo_test, 'DISTRICT')
[1] "14" "15" NA   NA   "10" "10"

Storing values

point_geo_test$state <- extract_values(point_geo_test, 'STATENAME')
point_geo_test$district <- extract_values(point_geo_test, 'DISTRICT')
point_geo_test$name <- paste(point_geo_test$state, point_geo_test$district, sep = "-")

Result:

> point_geo_test
Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -122.0095 ymin: 37.32935 xmax: 73.72036 ymax: 42.15549
Geodetic CRS:  GRS 1980(IUGG, 1980)
  congress      state district          name                   geometry
1      104 California       14 California-14 POINT (-122.0095 37.32935)
2      111 California       15 California-15 POINT (-122.0095 37.32935)
3      104       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
4      111       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
5      104   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)
6      111   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)

Edit 7 July:

From your comments I understand you were looking for something different, the assumption I made about why your function was giving multiple values was wrong. Hence this new answer from scratch:

The custom function you've written doesn't lend itself to row-by-row application, because it already processes all rows at once:

Given the following input:

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test contains these values:

> point_geo_test
[...]
  congress                   geometry
1      104 POINT (-122.0095 37.32935)
2      111 POINT (-122.0095 37.32935)
3      104   POINT (73.72036 41.1134)
4      111   POINT (73.72036 41.1134)
5      104 POINT (-87.86885 42.15549)
6      111 POINT (-87.86885 42.15549)

and extract_district() returns this:

> extract_district(point_geo_test, 104)
[...]
[1] "California-14" "California-14" "NA-NA"         "NA-NA"         "Illinois-10"   "Illinois-10"

This is already a result for each row. The only problem is, while they are the correct results for the coordinates of each row, they the name for those coordinates only during congress 104. Hence, these values are only valid for the rows in point_geo_test where congress == 104.

Extracting correct values for all rows

We will create a function that returns the correct data for all rows, eg the correct name for the coordinates during the associated congress.

I've simplified your code slightly: the df_test is not an intermediate data frame any more, but defined directly in the creation of point_geo_test. Any values I extract, I'll save into this data frame as well.

library(tidyverse)
library(sf)
sf_use_s2(FALSE)

districts_104 <- st_read("districts104.shp")
districts_111 <- st_read("districts111.shp")

congress <- c(104, 111, 104, 111, 104, 111)
latitude <- c(37.32935, 37.32935, 41.1134016, 41.1134016, 42.1554948, 42.1554948)
longitude <- c(-122.00954, -122.00954, 73.720356, 73.720356, -87.868850502543, -87.868850502543)

point_geo_test <- st_as_sf(data.frame(congress, latitude, longitude),
                           coords = c(x = "longitude", y = "latitude"),
                           crs = st_crs(districts_104))

To keep the code more flexible and organized, I'll create a generic function that can fetch any parameter for the given coordinates:

extract_values <- function(points, parameter) {
  # initialize return values, one for each row in `points`
  values <- rep(NA, nrow(points))
  
  # for each congress present in `points`, lookup parameter and store in the rows with matching congress
  for(cong in unique(points$congress)) {
    shapefile <- get(paste0("districts_", cong))
    st_join_results <- st_join(points, shapefile, join = st_within)
    values[points$congress == cong] <- st_join_results[[parameter]][points$congress == cong]
  }
  
  return(values)
}

Examples:

> extract_values(point_geo_test, 'STATENAME')
[1] "California" "California" NA           NA           "Illinois"   "Illinois"  
> extract_values(point_geo_test, 'DISTRICT')
[1] "14" "15" NA   NA   "10" "10"

Storing values

point_geo_test$state <- extract_values(point_geo_test, 'STATENAME')
point_geo_test$district <- extract_values(point_geo_test, 'DISTRICT')
point_geo_test$name <- paste(point_geo_test$state, point_geo_test$district, sep = "-")

Result:

> point_geo_test
Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -122.0095 ymin: 37.32935 xmax: 73.72036 ymax: 42.15549
Geodetic CRS:  GRS 1980(IUGG, 1980)
  congress      state district          name                   geometry
1      104 California       14 California-14 POINT (-122.0095 37.32935)
2      111 California       15 California-15 POINT (-122.0095 37.32935)
3      104       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
4      111       <NA>     <NA>         NA-NA   POINT (73.72036 41.1134)
5      104   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)
6      111   Illinois       10   Illinois-10 POINT (-87.86885 42.15549)

回复收藏 0 原文

~没有更多了~