将点平均在一起而不重复并减少最终数据帧

发布于 2025-01-16 04:28:50 字数 3000 浏览 1 评论 0 原文

目标是对 10 米内的点进行平均,而不重复平均中的任何点,将点数据帧减少到平均点,并理想地沿着所述点收集的路线获得平滑的点流。这是来自一个更大文件(25,000 个观察值)的 11 点子集示例数据框:

library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
                 ) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)

这是我尝试过的:

  1. st_is_within_distance(trav, trav, Tolerance) 的多次迭代,包括:
  2. 聚合方法。这些不起作用,因为相同的点会被多次平均。
  3. 通过尝试动态更新 lapply 中的列表,与 filteracross 接近,但最终没有成功。
  4. 这很有帮助< /a> 来自@jeffreyevans,但并没有真正解决问题,而且有点过时了。
  5. spThin 包不起作用,因为它是为更具体的变量而设计的。
  6. 我想使用这篇文章进行集群,但集群会抛出随机点,并且实际上并没有有效地减少数据帧。

这是我所得到的最接近的。同样,该解决方案的问题在于它在收集平均值时会重复点,这会赋予某些​​点比其他点更大的权重。

  # first set tolerance
  tolerance <- 20 # 20 meters
  
  # get distance between points
  i <- st_is_within_distance(df, df, tolerance)
  
  # filter for indices with more than 1 (self) neighbor
  i <- i[which(lengths(i) > 1)]   
 
  # filter for unique indices (point 1, 2 / point 2, 1)
  i <- i[!duplicated(i)]
    
  # points in `sf` object that have no neighbors within tolerance
  no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)),  ]  

  # iterate over indices of neighboring points
  avg_points <- lapply(i, function(b){
    df <- df[unlist(b), ]
    coords <- st_coordinates(df)
    
    df <- df %>%
      st_drop_geometry() %>%
      cbind(., coords)
    
    df_sum <-  df %>%
      summarise(
        datetime = first(datetime),
        trait = mean(trait),
        X = mean(X),
        Y = mean(Y),
        .groups = 'drop') %>% 
        ungroup()

    return(df)
  }) %>% 
    
    bind_rows() %>% 
    st_as_sf(coords = c('X', 'Y'),
             crs = "+proj=longlat +datum=WGS84 +no_defs ") 

The goal is to average points together within 10 meters without repeating any points in the averaging, reduce the point dataframe to the averaged points, and to ideally obtain a smooth flow of points along the routes said points were collected. Here is an 11 point subset example dataframe from a much larger file (25,000 observations):

library(sf)
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
                 ) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)

Here is what I've tried:

  1. Many iterations of st_is_within_distance(trav, trav, tolerance) including:
  2. an aggregate method shown here. These don't work because the same points get averaged multiple times.
  3. Got close with filter and across by trying to dynamically update a list in lapply but didn't work out in the end.
  4. This is helpful from @jeffreyevans, but doesn't really solve the problem and is a bit outdated.
  5. The spThin package doesn't work because it's made for more specific variables.
  6. I thought to cluster using this post, but the clusters throw random points and don't actually reduce the dataframe efficiently.

Here is as close as I've gotten. Again, the issue with this solution is it repeats points in collecting averages, which gives more weight to certain points than others.

  # first set tolerance
  tolerance <- 20 # 20 meters
  
  # get distance between points
  i <- st_is_within_distance(df, df, tolerance)
  
  # filter for indices with more than 1 (self) neighbor
  i <- i[which(lengths(i) > 1)]   
 
  # filter for unique indices (point 1, 2 / point 2, 1)
  i <- i[!duplicated(i)]
    
  # points in `sf` object that have no neighbors within tolerance
  no_neighbors <- trav[!(1:nrow(df) %in% unlist(i)),  ]  

  # iterate over indices of neighboring points
  avg_points <- lapply(i, function(b){
    df <- df[unlist(b), ]
    coords <- st_coordinates(df)
    
    df <- df %>%
      st_drop_geometry() %>%
      cbind(., coords)
    
    df_sum <-  df %>%
      summarise(
        datetime = first(datetime),
        trait = mean(trait),
        X = mean(X),
        Y = mean(Y),
        .groups = 'drop') %>% 
        ungroup()

    return(df)
  }) %>% 
    
    bind_rows() %>% 
    st_as_sf(coords = c('X', 'Y'),
             crs = "+proj=longlat +datum=WGS84 +no_defs ") 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

调妓 2025-01-23 04:28:50

另一个答案是使用 sf::aggregate() 和六边形网格来查找彼此之间特定距离内的点。也可以使用方形网格。结果会有所不同,具体取决于网格相对于点的确切位置,但在确定平均值时不得多次使用任何点。

步骤概述:

  • 加载数据,转换为 crs 5070 以米为单位的测量值
  • 获取数据的边界框,
  • 使用 mean<制作直径约为 10m 的边界框的六边形网格,每个
  • 聚合点落在同一六边形中/code>
  • 连接到按特征调整大小的原始数据
library(sf)
library(tidyverse)

set.seed(22) # might be needed to get same hex grid?

#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs) %>%
  st_transform(5070)  ### transform to 5070 for a projection in meters
#### end sample data


# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()

# Make a grid as hexagons with approximately the right size 
#  area ~86m; side ~5.75m; long diag ~11.5m 
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()

# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait))

# Assign the mean of the hexagon to points that fall 
#  within each hexagon
df_agg <- st_join(df, hex_agg)

head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#>   trait.x          datetime.x  trait.y          datetime.y
#> 1   91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2   91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3   91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4   91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5   91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6   92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#>                   geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)

sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0

ggplot(df_agg) + 
  geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') +  # Original triat
  geom_sf(aes(size = trait.y), alpha = .2, color = 'red') +  # New aggregated trait
  theme_void()

。蓝点是原始点,红点是新的空间均值。


## Plot of
# original points & hex grid used:
ggplot() + 
  geom_sf(data = df, color = 'red') + 
  geom_sf(data = hex_grid, fill = NA) +
  theme_void()

显示均值点分组的图。看起来每个六边形有 1,2 和 3 个点作为平均值。

reprex 包 (v2.0.1)

编辑

更新为每个六边形只有一个点,丢失了一些原始点

## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait)) %>%
  rownames_to_column('hex_num')  # add hexagon number to group_by

## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
  group_by(hex_num) %>%
  summarise() %>%
  st_centroid()

# The full sf object with only one point per hexagon
#  this join isn't the most efficient, but slice(1) removes
#  the duplicate data. You could clean df_agg before the join
#  to resolve this
final_join <- df_agg %>%
                 st_drop_geometry() %>%
                 left_join(hex_guide, by = 'hex_num') %>% 
                 group_by(hex_num) %>%
                 slice(1) %>%
                 ungroup() %>%
                 st_as_sf()

ggplot() +
  geom_sf(data = final_join, color = 'red', size = 3) +
  geom_sf(data = df, color = 'black', alpha = .5) +
  geom_sf(data = hex_grid, color = 'blue', fill = NA)

该图显示了六边形,原始数据点为灰色,新的红色点位于分组原始点的质心。每个六边形只有 1 个红点。

输入图片此处描述

Another answer, using sf::aggregate() and a hexagonal grid to find points that are within a particular distance from each other. A square grid could be used as well. Results will vary some depending on where exactly the grid falls in relation to the points, but no point should be used more than once in determining the mean.

Outline of the steps:

  • load data, transform to crs 5070 for measurements in meters
  • get a bounding box of the data
  • make a grid of hexagons of the bounding box of ~10m diameter each
  • aggregate points falling in the same hexagon using mean
  • join to original data
library(sf)
library(tidyverse)

set.seed(22) # might be needed to get same hex grid?

#### your sample data
df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs) %>%
  st_transform(5070)  ### transform to 5070 for a projection in meters
#### end sample data


# Get a bounding box as an sf object to make a grid
bbox <- st_bbox(df) %>% st_as_sfc()

# Make a grid as hexagons with approximately the right size 
#  area ~86m; side ~5.75m; long diag ~11.5m 
hex_grid <- st_make_grid(bbox, cellsize = 10, square = F) %>% st_as_sf()

# Aggregate mean of the hexagonal grid
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait))

# Assign the mean of the hexagon to points that fall 
#  within each hexagon
df_agg <- st_join(df, hex_agg)

head(df_agg) # trait.x from df, trait.y from the mean by hexagon
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 121281.6 ymin: 1786179 xmax: 121285.3 ymax: 1786227
#> Projected CRS: NAD83 / Conus Albers
#>   trait.x          datetime.x  trait.y          datetime.y
#> 1   91.22 2021-08-06 15:08:43 91.70500 2021-08-06 15:26:00
#> 2   91.22 2021-08-06 15:08:44 91.32667 2021-08-06 15:31:47
#> 3   91.22 2021-08-06 15:08:46 91.22000 2021-08-06 15:08:46
#> 4   91.58 2021-08-06 15:08:47 91.58000 2021-08-06 15:08:47
#> 5   91.47 2021-08-06 15:43:17 91.47000 2021-08-06 15:43:17
#> 6   92.19 2021-08-06 15:43:18 91.70500 2021-08-06 15:26:00
#>                   geometry
#> 1 POINT (121282.5 1786184)
#> 2 POINT (121283.2 1786194)
#> 3 POINT (121284.6 1786216)
#> 4 POINT (121285.3 1786227)
#> 5 POINT (121281.7 1786179)
#> 6 POINT (121281.6 1786186)

sum(df_agg$trait.x) - sum(df_agg$trait.y) # original trait - aggregate trait should be 0, or near 0
#> [1] 0

ggplot(df_agg) + 
  geom_sf(aes(size = trait.x), alpha = .2, color = 'blue') +  # Original triat
  geom_sf(aes(size = trait.y), alpha = .2, color = 'red') +  # New aggregated trait
  theme_void()

Sized by trait. Blue points are original, red is the new spatial mean.


## Plot of
# original points & hex grid used:
ggplot() + 
  geom_sf(data = df, color = 'red') + 
  geom_sf(data = hex_grid, fill = NA) +
  theme_void()

Plot showing the grouping of the points for the mean. Looks like there were groups of 1,2, and 3 points per hexagon for the mean.

Created on 2022-03-23 by the reprex package (v2.0.1)

Edit

Updated to have only one point per hexagon, losing some of the original points

## Edit for one point per hexagon:
hex_agg <- aggregate(df ,
                     hex_grid,
                     mean,
                     join = st_contains) %>% filter(!is.na(trait)) %>%
  rownames_to_column('hex_num')  # add hexagon number to group_by

## Guide to join on, has only hexagon number & centroid of contained points
hex_guide <- df_agg %>%
  group_by(hex_num) %>%
  summarise() %>%
  st_centroid()

# The full sf object with only one point per hexagon
#  this join isn't the most efficient, but slice(1) removes
#  the duplicate data. You could clean df_agg before the join
#  to resolve this
final_join <- df_agg %>%
                 st_drop_geometry() %>%
                 left_join(hex_guide, by = 'hex_num') %>% 
                 group_by(hex_num) %>%
                 slice(1) %>%
                 ungroup() %>%
                 st_as_sf()

ggplot() +
  geom_sf(data = final_join, color = 'red', size = 3) +
  geom_sf(data = df, color = 'black', alpha = .5) +
  geom_sf(data = hex_grid, color = 'blue', fill = NA)

The plot shows the hexagons, original data points in grey, and new red points at the centroid of grouped original points. Only 1 red point per hexagon.

enter image description here

∞梦里开花 2025-01-23 04:28:50

我不确定,但这也许就是您正在寻找的?

您可以尝试 smoothr::smooth() 的不同设置/方法以获得所需的结果。

library(tidyverse)
library(igraph)
library(smoothr)
library(mapview)  # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)

彼此距离在 10 米以内的点?
输入图片此处描述

# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
  group_by(id) %>%
  dplyr::add_tally() %>%
  dplyr::filter(n > 1) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("LINESTRING") %>%
  # create smooth lines
  smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))

在此处输入图像描述

I'm not sure, but perhaps this is what you are looking for?

You can experiment with the different settings/methods of smoothr::smooth() to get the desired results.

library(tidyverse)
library(igraph)
library(smoothr)
library(mapview)  # for viewing purposes only
# get a matrix of points <10 meter apart
m <- st_is_within_distance(df, dist = 10, sparse = FALSE)
# creata an igraph from the matrix
g <- graph.adjacency(m, mode="undirected", diag = FALSE)
plot(g)

points that are are withing 10 metres of eachother?
enter image description here

# pass cluster-number to df object
df$id <- as.vector(components(G)$membership)
# create polylines (only if more than 1 point!)
df.lines <- df %>%
  group_by(id) %>%
  dplyr::add_tally() %>%
  dplyr::filter(n > 1) %>%
  summarise(geometry = st_combine(geometry)) %>%
  st_cast("LINESTRING") %>%
  # create smooth lines
  smoothr::smooth(method = "ksmooth")
#view points and lines
mapview::mapview(list(df, df.lines))

enter image description here

[浮城] 2025-01-23 04:28:50

如果我正确理解你的问题,一切都归结为选择“正确的”邻居,即那些在某个社区内但尚未使用的邻居。如果没有这样的邻居,则只需使用该点本身(即使它已在另一个点的平均中使用)。

这里有一个解决方案,使用 purrr::accumulate 首先生成正确的索引,然后简单地使用这些索引进行平均:

library(purrr)
library(dplyr)

idx <- accumulate(i[-1L], function(x, y) {
   x$points <- setdiff(y, x$used)
   x$used <- union(x$used, y)
   x
}, .init = list(used = i[[1L]], points = i[[1L]]))

idx[1:4]

# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
# 
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
# 
# [[2]]
# [[2]]$used
# [1]  1  2  5  6  7  8  9 10 11
# 
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
# 
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
#
# [[4]]$points
# integer(0)

这个想法是我们维护一个已用索引列表,即已在任何邻域中使用的以及剩余)。例如,对于第一个点,我们使用索引 1,2, 5, 6, 7, 8, 9 处的点,第二个点只留下索引 10, 11 。如果没有剩下的点,我们返回integer(0)

现在我们已经设置了索引列表,剩下的就很简单了,通过循环列表,选择指示的点(使用点本身,以防没有剩下的点)并进行平均:

idx %>% 
   imap_dfr(function(x, y) {
      if (!length(x$points)) {
         idx <- y
      } else {
         idx <- x$points
      }
      df[idx, , drop = FALSE] %>% 
         bind_cols(st_coordinates(.) %>% as_tibble()) %>% 
         st_drop_geometry() %>% 
         summarise(datetime = first(datetime),
                   trait = mean(trait),
                   X = mean(X),
                   Y = mean(Y))
   }) %>% 
   st_as_sf(coords = c('X', 'Y'),
            crs = "+proj=longlat +datum=WGS84 +no_defs ") 

# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS:           +proj=longlat +datum=WGS84 +no_defs 
# First 10 features:
#               datetime    trait                   geometry
# 1  2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2  2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3  2021-08-06 15:08:46 91.40000  POINT (-94.5846 39.09286)
# 4  2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5  2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6  2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7  2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8  2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9  2021-08-06 15:43:21 90.57000  POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000  POINT (-94.5847 39.09273)

If I understand your problem correctly, all boils down to selecting the "right" neighbors, i.e. those within a certain neighborhood, which were not used yet. If there is no such neighbor, simply use the point itself (even if it was used already in the averaging for another point).

Here's a solution using purrr::accumulate to first produce the correct indices and then simply use these indices to do the averaging:

library(purrr)
library(dplyr)

idx <- accumulate(i[-1L], function(x, y) {
   x$points <- setdiff(y, x$used)
   x$used <- union(x$used, y)
   x
}, .init = list(used = i[[1L]], points = i[[1L]]))

idx[1:4]

# [[1]]
# [[1]]$used
# [1] 1 2 5 6 7 8 9
# 
# [[1]]$points
# [1] 1 2 5 6 7 8 9
#
# 
# [[2]]
# [[2]]$used
# [1]  1  2  5  6  7  8  9 10 11
# 
# [[2]]$points
# [1] 10 11
#
#
# [[3]]
# [[3]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
# 
# [[3]]$points
# [1] 3 4
#
#
# [[4]]
# [[4]]$used
# [1]  1  2  5  6  7  8  9 10 11  3  4
#
# [[4]]$points
# integer(0)

The idea is that we maintain a list of used indices, that is, the ones which already used in any of the neighborhoods and the remainders (points). For instance, for the first point we use points at indices 1,2, 5, 6, 7, 8, 9 which leaves only indices 10, 11 for the second point. If there is no point left, we return integer(0).

Now that we have set up the indices list, the rest is easy, by looping through the list, selecting the indicated points (using the point itself in case there is no point left) and doing the avering:

idx %>% 
   imap_dfr(function(x, y) {
      if (!length(x$points)) {
         idx <- y
      } else {
         idx <- x$points
      }
      df[idx, , drop = FALSE] %>% 
         bind_cols(st_coordinates(.) %>% as_tibble()) %>% 
         st_drop_geometry() %>% 
         summarise(datetime = first(datetime),
                   trait = mean(trait),
                   X = mean(X),
                   Y = mean(Y))
   }) %>% 
   st_as_sf(coords = c('X', 'Y'),
            crs = "+proj=longlat +datum=WGS84 +no_defs ") 

# Simple feature collection with 11 features and 2 fields
# Geometry type: POINT
# Dimension:     XY
# Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
# CRS:           +proj=longlat +datum=WGS84 +no_defs 
# First 10 features:
#               datetime    trait                   geometry
# 1  2021-08-06 15:08:43 91.34714 POINT (-94.58464 39.09259)
# 2  2021-08-06 15:43:22 91.65000 POINT (-94.58473 39.09274)
# 3  2021-08-06 15:08:46 91.40000  POINT (-94.5846 39.09286)
# 4  2021-08-06 15:08:47 91.58000 POINT (-94.58459 39.09291)
# 5  2021-08-06 15:43:17 91.47000 POINT (-94.58464 39.09248)
# 6  2021-08-06 15:43:18 92.19000 POINT (-94.58464 39.09255)
# 7  2021-08-06 15:43:19 92.19000 POINT (-94.58464 39.09261)
# 8  2021-08-06 15:43:20 90.57000 POINT (-94.58464 39.09266)
# 9  2021-08-06 15:43:21 90.57000  POINT (-94.58466 39.0927)
# 10 2021-08-06 15:43:22 91.65000  POINT (-94.5847 39.09273)
情定在深秋 2025-01-23 04:28:50

如果目标是不使群集平均值中的任何点的权重高于任何其他点,则使用加权平均值会更加平衡,而不是尝试强制每个群集包含一组与所有其他群集不同的点。

思考以下方法的一种方法是“切碎”每个观测值,并将这些片段分成簇,使每个簇中的片段的权重总和为 1。

对于 25k 个观测值来说,这可能太昂贵了,因此一种选择可能是对重叠或不重叠的片段执行此操作并将它们缝合在一起。

library(sf)
library(Rfast) # for the 'eachrow' function

df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)

# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)

while (max(abs(marg - 1)) > tol) {
  w <- w/marg
  marg <- colSums(w)
  w <- eachrow(w, marg, "/")
  marg <- rowSums(w)
}

df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#>       trait            datetime                   geometry
#> 1  91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2  91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3  91.31374 2021-08-06 15:08:46  POINT (-94.5846 39.09281)
#> 4  91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5  91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6  91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7  91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8  91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9  91.41380 2021-08-06 15:43:21  POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22  POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)

# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07

更新:如果确实需要排它分组方法,则这会实现贪婪算法:

avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
  if (sum(df$unused[i[[cl]]])) {
    currclust <- currclust + 1L
    avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
    clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
    df$unused[i[[cl]]] <- FALSE
  }
}

avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#> 
#> [[2]]
#> [1] 10 11
#> 
#> [[3]]
#> [1] 3 4

请注意,权重不均匀的问题仍然存在——第 1 组中的观测值各自加权 1/7,而第 2 组和第 3 组中的观测值则为 1/7。每个权重为 1/2。

If the goal is to not weight any point more than any other point in the cluster averages, it would be more balanced to use weighted averages rather than trying to force each cluster to contain a set of points unique from all other clusters.

One way to think of the below methodology is to "chop up" each observation and divvy up the pieces into clusters in such a way that the weight of the pieces in each cluster sums to 1.

This will probably be too expensive for 25k observations, so one option could be to perform this on overlapping or non-overlapping segments and stitch them together.

library(sf)
library(Rfast) # for the 'eachrow' function

df <- data.frame(trait = as.numeric(c(91.22,91.22,91.22,91.58,91.47,92.19,92.19,90.57,90.57,91.65,91.65)), 
                 datetime = as.POSIXct(c("2021-08-06 15:08:43","2021-08-06 15:08:44","2021-08-06 15:08:46","2021-08-06 15:08:47","2021-08-06 15:43:17","2021-08-06 15:43:18","2021-08-06 15:43:19","2021-08-06 15:43:20","2021-08-06 15:43:21","2021-08-06 15:43:22","2021-08-06 15:43:23")),
                 lat = c(39.09253, 39.09262, 39.09281, 39.09291, 39.09248, 39.09255, 39.09261, 39.09266, 39.0927, 39.09273, 39.09274),
                 lon = c(-94.58463, -94.58462, -94.5846, -94.58459, -94.58464, -94.58464, -94.58464, -94.58464, -94.58466, -94.5847, -94.58476)
) # just to add some value that is plotable
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
df <- st_as_sf(x = df,                         
               coords = c("lon", "lat"),
               crs = projcrs)
n <- nrow(df)
# sum the trait column for a sanity check after calculations
sumtrait <- sum(df$trait)

# first set tolerance
tolerance <- 20 # 20 meters
tol <- 1e-5 # tolerance for the weight matrix marginal sums
# create clusters of points grouped by circles centered at each point
i <- st_is_within_distance(df, df, tolerance)
# Initialize a matrix for the weight of each point within each cluster. The
# initial value represents an unweighted average for each cluster, so the row
# sums are not necessarily 1.
sz <- lengths(i)
w <- replace(matrix(0, n, n), unlist(sapply(1:n, function(x) i[[x]] + n*(x - 1))), rep.int(1/sz, sz))
# iteratively adjust the weights until the marginal sums all equal 1 (within
# tolerance)
marg <- rowSums(w)

while (max(abs(marg - 1)) > tol) {
  w <- w/marg
  marg <- colSums(w)
  w <- eachrow(w, marg, "/")
  marg <- rowSums(w)
}

df$trait <- colSums(w*df$trait)
print(df, n = nrow(df))
#> Simple feature collection with 11 features and 2 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -94.58476 ymin: 39.09248 xmax: -94.58459 ymax: 39.09291
#> CRS:           +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
#>       trait            datetime                   geometry
#> 1  91.37719 2021-08-06 15:08:43 POINT (-94.58463 39.09253)
#> 2  91.44430 2021-08-06 15:08:44 POINT (-94.58462 39.09262)
#> 3  91.31374 2021-08-06 15:08:46  POINT (-94.5846 39.09281)
#> 4  91.46755 2021-08-06 15:08:47 POINT (-94.58459 39.09291)
#> 5  91.64053 2021-08-06 15:43:17 POINT (-94.58464 39.09248)
#> 6  91.37719 2021-08-06 15:43:18 POINT (-94.58464 39.09255)
#> 7  91.44430 2021-08-06 15:43:19 POINT (-94.58464 39.09261)
#> 8  91.41380 2021-08-06 15:43:20 POINT (-94.58464 39.09266)
#> 9  91.41380 2021-08-06 15:43:21  POINT (-94.58466 39.0927)
#> 10 91.31880 2021-08-06 15:43:22  POINT (-94.5847 39.09273)
#> 11 91.31880 2021-08-06 15:43:23 POINT (-94.58476 39.09274)

# check that the sum of the "traits" column is unchanged
sum(df$trait) - sumtrait
#> [1] 4.875536e-07

UPDATE: If an exclusive grouping method is really needed, this implements a greedy algorithm:

avg_points <- numeric(nrow(df))
clusters <- vector("list", nrow(df))
currclust <- 0L
df$unused <- TRUE
for (cl in seq_along(df)) {
  if (sum(df$unused[i[[cl]]])) {
    currclust <- currclust + 1L
    avg_points[currclust] <- mean(df$trait[i[[cl]]][df$unused[i[[cl]]]])
    clusters[[currclust]] <- i[[cl]][df$unused[i[[cl]]]]
    df$unused[i[[cl]]] <- FALSE
  }
}

avg_points <- avg_points[1:currclust]
clusters <- clusters[1:currclust]
avg_points
#> [1] 91.34714 91.65000 91.40000
clusters
#> [[1]]
#> [1] 1 2 5 6 7 8 9
#> 
#> [[2]]
#> [1] 10 11
#> 
#> [[3]]
#> [1] 3 4

Note that the issue of uneven weightings is still present--the observations in group 1 are each weighted 1/7, while the observations in groups 2 and 3 are each weighted 1/2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文