如何使用 R 使用地理邻近度来填充缺失的分类值？

发布于 2024-12-14 17:50:30 字数 1157 浏览 2 评论 0原文

我有一些如下所示的数据：

ID      lat      long     university   date        cat2    cat3   cat4   ...
00001   32.001   -64.001  MIT          2011-07-01  xyz     foo    NA     ...
00002   45.783   67.672   Harvard      2011-07-01  abc     NA     lion   ...
00003   54.823   78.762   Stanford     2011-07-01  xyz     bar    NA     ...
00004   76.782   23.989   IIT Bombay   2011-07-02  NA      foo    NA     ...
00005   32.010   -64.010  NA           2011-07-02  NA      NA     hamster...
00006   32.020   -64.020  NA           2011-07-03  NA      NA     NA     ...
00006   45.793   67.700   NA           2011-08-01  NA      bar    badger ...

我想根据经纬度坐标估算大学列的缺失值。这显然是编造的，因为数据有 500K 行，而大学列上的数据相当稀疏。像 Amelia 这样的插补包似乎想要根据线性模型来拟合数值数据，而 Zoo 似乎想要根据某种有序序列来填充缺失值，而我没有。我想要匹配接近经纬度，而不仅仅是精确的经纬度对，因此我不能仅通过匹配另一列的值来填充一列。

我计划通过查找与大学相关的所有经纬度对来解决该问题，在它们周围绘制一个边界框，然后对于具有经纬度对但缺少大学数据的所有行，根据哪个经纬度为大学添加适当的值- 他们所在的长盒子，或者可能在已知位置中点的一定半径内。

有人做过类似的事情吗？是否有任何软件包可以更轻松地对地理上邻近的经纬度对进行分组，甚至可以进行基于地理的插补？

如果可行的话，我想尝试根据数据中的现有值来估算其他一些缺失值（例如 90% 的带有 xyz、foo、Harvard 值的行也属于第四类，所以我们可以估算 cat4 的一些缺失值），但这是另一个问题，我会想象一个更难的问题，我什至可能没有足够的数据来成功完成。

原文

I have some data that looks like this:

ID      lat      long     university   date        cat2    cat3   cat4   ...
00001   32.001   -64.001  MIT          2011-07-01  xyz     foo    NA     ...
00002   45.783   67.672   Harvard      2011-07-01  abc     NA     lion   ...
00003   54.823   78.762   Stanford     2011-07-01  xyz     bar    NA     ...
00004   76.782   23.989   IIT Bombay   2011-07-02  NA      foo    NA     ...
00005   32.010   -64.010  NA           2011-07-02  NA      NA     hamster...
00006   32.020   -64.020  NA           2011-07-03  NA      NA     NA     ...
00006   45.793   67.700   NA           2011-08-01  NA      bar    badger ...

I want to impute missing values for the university column based on the lat-long coordinates. This is obviously made up, as the data's 500K rows and rather sparse on the university column. Imputation packages like Amelia seem to want to fit numerical data according to a linear model and zoo seems to want to fill in missing values based on some sort of ordered series, which I don't have. I want to match close lat-longs, not just exact lat-long pairs, so I can't just fill in one column by matching values from another.

I plan to approach the problem by finding all the lat-long pairs associated with a university, draw a bounding box around them, then for all rows with lat-long pairs but missing university data, add the appropriate value for university depending on which lat-long box they're in, or perhaps within a certain radius of the midpoint of the known locations.

Has anyone ever done something similar? Are there any packages that make it easier to group geographically proximate lat-long pairs or maybe even to do geographically-based imputation?

If that works, I'd like to take a crack at imputing some of the other missing values based on existing value in the data (like 90% of rows with xyz, foo, Harvard values also have lion in the 4th category, so we can impute some missing values for cat4) but that's another question and I would imagine a much harder one, which I might not even have enough data to do successfully.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

乖乖公主 2024-12-21 17:50:41

我心里没有一个包来解决你所描述的问题。我已经做了一些类似的类型分析，最后我写了一些定制的东西。

只是为了给您一个起点，下面是进行最近邻居计算的一种方法的示例。计算邻居有点慢，因为显然，您必须针对每个其他点来计算每个点。

## make some pretend data
n <- 1e4
lat <- rnorm(n)
lon <- rnorm(n)
index <- 1:n
myDf <- data.frame(lat, lon, index)

## create a few helper functions
cartDist <- function(x1, y1, x2, y2){
  ( (x2 - x1)^2 - (y2 - y1)^2 )^.5
}

nearestNeighbors <- function(x1, y1, x2, y2, n=1){
  dists <- cartDist(x1, y1, x2, y2)
  orders <- order(dists)
  index <- which(orders <= n)
  neighborValues <- dists[index]
  return(list(index, neighborValues))
}


## this could be done in an apply statement
## but it's fugly enough as a loop
system.time({
for (i in 1:nrow(myDf)){
  myDf[i,]$nearestNeighbor <- myDf[nearestNeighbors( myDf[i,]$lon, myDf[i,]$lat,  myDf[-i,]$lon, myDf[-i,]$lat )[[1]],]$index
}
})

I don't have a package in mind to solve what you're describing. I've done some similar type analysis and I ended up writing something bespoke.

Just to give you a jumping off point, here's an example of one way of doing a nearest neighbor calculation. Calculating neighbors is kind of slow because, obviously, you have to calculate every point against every other point.

## make some pretend data
n <- 1e4
lat <- rnorm(n)
lon <- rnorm(n)
index <- 1:n
myDf <- data.frame(lat, lon, index)

## create a few helper functions
cartDist <- function(x1, y1, x2, y2){
  ( (x2 - x1)^2 - (y2 - y1)^2 )^.5
}

nearestNeighbors <- function(x1, y1, x2, y2, n=1){
  dists <- cartDist(x1, y1, x2, y2)
  orders <- order(dists)
  index <- which(orders <= n)
  neighborValues <- dists[index]
  return(list(index, neighborValues))
}


## this could be done in an apply statement
## but it's fugly enough as a loop
system.time({
for (i in 1:nrow(myDf)){
  myDf[i,]$nearestNeighbor <- myDf[nearestNeighbors( myDf[i,]$lon, myDf[i,]$lat,  myDf[-i,]$lon, myDf[-i,]$lat )[[1]],]$index
}
})

回复收藏 0 原文

~没有更多了~