如何使用 R 使用地理邻近度来填充缺失的分类值?
我有一些如下所示的数据:
ID lat long university date cat2 cat3 cat4 ...
00001 32.001 -64.001 MIT 2011-07-01 xyz foo NA ...
00002 45.783 67.672 Harvard 2011-07-01 abc NA lion ...
00003 54.823 78.762 Stanford 2011-07-01 xyz bar NA ...
00004 76.782 23.989 IIT Bombay 2011-07-02 NA foo NA ...
00005 32.010 -64.010 NA 2011-07-02 NA NA hamster...
00006 32.020 -64.020 NA 2011-07-03 NA NA NA ...
00006 45.793 67.700 NA 2011-08-01 NA bar badger ...
我想根据经纬度坐标估算大学列的缺失值。这显然是编造的,因为数据有 500K 行,而大学列上的数据相当稀疏。像 Amelia 这样的插补包似乎想要根据线性模型来拟合数值数据,而 Zoo 似乎想要根据某种有序序列来填充缺失值,而我没有。我想要匹配接近经纬度,而不仅仅是精确的经纬度对,因此我不能仅通过匹配另一列的值来填充一列。
我计划通过查找与大学相关的所有经纬度对来解决该问题,在它们周围绘制一个边界框,然后对于具有经纬度对但缺少大学数据的所有行,根据哪个经纬度为大学添加适当的值- 他们所在的长盒子,或者可能在已知位置中点的一定半径内。
有人做过类似的事情吗?是否有任何软件包可以更轻松地对地理上邻近的经纬度对进行分组,甚至可以进行基于地理的插补?
如果可行的话,我想尝试根据数据中的现有值来估算其他一些缺失值(例如 90% 的带有 xyz、foo、Harvard 值的行也属于第四类,所以我们可以估算 cat4 的一些缺失值),但这是另一个问题,我会想象一个更难的问题,我什至可能没有足够的数据来成功完成。
I have some data that looks like this:
ID lat long university date cat2 cat3 cat4 ...
00001 32.001 -64.001 MIT 2011-07-01 xyz foo NA ...
00002 45.783 67.672 Harvard 2011-07-01 abc NA lion ...
00003 54.823 78.762 Stanford 2011-07-01 xyz bar NA ...
00004 76.782 23.989 IIT Bombay 2011-07-02 NA foo NA ...
00005 32.010 -64.010 NA 2011-07-02 NA NA hamster...
00006 32.020 -64.020 NA 2011-07-03 NA NA NA ...
00006 45.793 67.700 NA 2011-08-01 NA bar badger ...
I want to impute missing values for the university column based on the lat-long coordinates. This is obviously made up, as the data's 500K rows and rather sparse on the university column. Imputation packages like Amelia seem to want to fit numerical data according to a linear model and zoo seems to want to fill in missing values based on some sort of ordered series, which I don't have. I want to match close lat-longs, not just exact lat-long pairs, so I can't just fill in one column by matching values from another.
I plan to approach the problem by finding all the lat-long pairs associated with a university, draw a bounding box around them, then for all rows with lat-long pairs but missing university data, add the appropriate value for university depending on which lat-long box they're in, or perhaps within a certain radius of the midpoint of the known locations.
Has anyone ever done something similar? Are there any packages that make it easier to group geographically proximate lat-long pairs or maybe even to do geographically-based imputation?
If that works, I'd like to take a crack at imputing some of the other missing values based on existing value in the data (like 90% of rows with xyz, foo, Harvard values also have lion in the 4th category, so we can impute some missing values for cat4) but that's another question and I would imagine a much harder one, which I might not even have enough data to do successfully.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我心里没有一个包来解决你所描述的问题。我已经做了一些类似的类型分析,最后我写了一些定制的东西。
只是为了给您一个起点,下面是进行最近邻居计算的一种方法的示例。计算邻居有点慢,因为显然,您必须针对每个其他点来计算每个点。
I don't have a package in mind to solve what you're describing. I've done some similar type analysis and I ended up writing something bespoke.
Just to give you a jumping off point, here's an example of one way of doing a nearest neighbor calculation. Calculating neighbors is kind of slow because, obviously, you have to calculate every point against every other point.