多边形中的 r 点
我有一百万个点和一个很大的形状文件(8GB),它太大了,无法加载到我系统上的 R 内存中。形状文件是单层的,因此给定的 x
、y
将最多命中一个多边形 - 只要它不完全位于边界上!每个多边形都标有严重性
- 例如1
、2
、3
。我在具有 12GB 内存的 64 位 ubuntu 机器上使用 R。
能够将数据框“标记”到多边形severity
的最简单方法是什么,以便我获得带有额外列的data.frame
,即x ,
y
,严重性
?
I have a million points and a large shape file—8GB—which is too big for to load into memory in R on my system. The shape file is single-layer so a given x
, y
will hit at most one polygon - as long as it's not exactly on a boundary! Each polygon is labelled with a severity
- e.g. 1
, 2
, 3
. I'm using R on a 64-bit ubuntu machine with 12GB ram.
What's the simplest way to be able to "tag" the data frame to the polygon severity
so that I get a data.frame
with an extra column, i.e. x
,y
, severity
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
仅仅因为你只有一把锤子,并不意味着所有问题都是钉子。
将数据加载到 PostGIS,为多边形构建空间索引,并执行单个 SQL 空间叠加。将结果导出回 R。
顺便说一句,说 shapefile 是 8Gb 并不是一个非常有用的信息。 Shapefile 至少由三个文件组成:.shp(几何图形)、.dbf(数据库)和 .shx(连接这两个文件)。如果您的 .dbf 是 8Gb,那么您可以通过将其替换为不同的 .dbf 来轻松读取形状本身。即使 .shp 是 8Gb,它也可能只是三个多边形,在这种情况下,可能很容易简化它们。你有多少个多边形,shapefile 的 .shp 部分有多大?
Just because all you have is a hammer, doesn't mean every problem is a nail.
Load your data into PostGIS, build a spatial index for your polygons, and do a single SQL spatial overlay. Export results back to R.
By the way, saying the shapefile is 8Gb is not a very useful piece of information. Shapefiles are made from at least three files, the .shp which is the geometry, the .dbf which is the database, and the .shx which connects the two. If your .dbf is 8Gb then you can easily read the shapes themselves in by replacing it with a different .dbf. Even if the .shp is 8Gb it might only be three polygons, in which case it might be easy to simplify them. How many polygons have you got, and how big is the .shp part of the shapefile?
我认为您应该预处理数据并创建一个结构,列出网格中矩形区域的可能多边形。这样,您可以减少必须检查点的多边形,并且附加结构将适合内存,因为它只有指向多边形的指针。
这是一张图片来说明:
你想要检查黄点位于哪个多边形。您通常会检查所有多边形,但通过网格优化(橙色线,没有绘制整个网格,仅绘制其字段之一),您只需检查填充的多边形因为它们都在或部分在网格区域内。
类似的方法是不将所有多边形数据存储在内存中,而只存储多边形边界框,每个多边形只需要 2 个而不是 3 个 X/Y 对(以及指向实际多边形数据的附加指针),但这没有第一个建议节省那么多空间。
I think you should preprocess the data and create a structure that lists possible polygons for rectangular areas in a grid. This way, you can reduce the polygons you'll have to check against the points and the additional structure will fit into memory as it just has pointers to the polygons.
Here's an image to illustrate:
You want to check which polygon the yellow point is in. You would usually check against all the polygons, but with the grid optimization (the orange lines, didn't draw the whole grid, just one of its fields) you only have to checked the filled polygons as they all are inside or partially inside the grid field.
A similar way would be not to store all the polygon data in memory, but just the polygons bounding boxes which would only require 2 instead of 3 X/Y pairs for each polygon (and an additional pointer to the actual polygon data), but this doesn't save as much space as the first suggestion.
我对此很感兴趣,想知道您在这方面是否取得了任何进展。自从您提出问题以来,我想您的计算机硬件和可用于执行此相对简单操作的软件已经有所改进,解决方案(如果仍然需要!)可能非常简单,尽管可能需要很长时间才能完成处理一百万个点。您可能想尝试以下操作:
希望该框架能给您带来您想要的东西。无论你是否可以用你现在可用的计算机/RAM 来做到这一点是另一回事!
I was interested to see this and wondered if you had made any progress on this front. Since you asked the question I imagine your computer hardware and the software you can use to do this relatively simple operation have improved somewhat to the point where the solution (if still needed!) might be quite simple, although it may take a long time to process a million points. You might like to try something like:
Hopefully that framework can give you what you want. Whether or not you can do it with the computer/RAM you have available now is another matter!
我没有一个很好的答案,但让我提出一个想法。您能否翻转问题,而不是问每个点适合哪个多边形,而是问“每个多边形中有哪些点?”也许您可以将 shapefile 分解为例如 2000 个县,然后逐步获取每个县并检查每个点以查看它是否在该县内。如果某个点位于给定的县,那么您可以对其进行标记,并在下次搜索时将其从搜索中删除。
同样,您可以将 shapefile 分成 4 个区域。然后您可以将单个区域加上所有点放入内存中。然后只需迭代数据 4 次即可。
另一个想法是使用 GIS 工具来降低 shapefile 的分辨率(节点和向量的数量)。当然,这取决于准确性对您的用例有多重要。
I don't have a really good answer, but let me throw out an idea. Can you flip the problem around and instead of asking which poly each point fits in, instead as "which points are in each poly?" Maybe you could bust your shapefile into, for example, 2000 counties then incrementally take each county and check each point to see if it is in that county. If a point is in a given county then you tag it, and take it out of your search next time.
Along the same lines, you might break the shapefile into 4 regions. Then you can fit a single region plus all your points into memory. Then just iterate 4 times through the data.
Another idea would be to use a GIS tool to lower the resolution (number of nodes and vectors) of the shapefile. That depends, of course, on how important accuracy is to your use case.
我会尝试 fastshp 包。在我的粗略测试中,它明显优于其他方法阅读 shapefiles。它有专门的内部功能,非常适合您的需求。
代码应该类似于:
其中 x 和 y 是坐标。
如果这不起作用,我会选择 @Spacedman 提到的 PostGIS 解决方案。
I'd give fastshp package a try. In my cursory tests it significantly beats other methods for reading shapefiles. And it has specialized inside function that my well suit your needs.
Code should be somehow similar to:
where x and y are coordinates.
If that doesn't work I'd go for PostGIS solution mentioned by @Spacedman.
回答我自己的问题......并感谢大家的帮助 - 最终采用的解决方案是使用 python 中的 gdal ,它相对容易适应光栅和形状文件。一些栅格文件达到了 40GB 左右,一些形状文件超过了 8GB - 所以它们无法放入我们当时拥有的任何机器的内存中(现在我可以使用具有 128GB 内存的机器) - 但我已经搬到了新的牧场!)。 python/gdal 代码能够在 1 分钟到 40 分钟内标记 2700 万个点,具体取决于 shapefile 中多边形的大小 - 如果有很多小多边形,速度快得惊人 - 如果有一些巨大的(250k 点) )形状文件中的多边形速度慢得惊人!然而相比之下,我们之前在 Oracle 空间数据库上运行它,标记 2700 万个点大约需要 24 小时以上,或者栅格化和标记大约需要一个小时。正如 Spacedman 建议的那样,我确实尝试在带有 SSD 的机器上使用 postgis,但周转时间比使用 python/gdal 慢很多,因为最终的解决方案不需要将 shapefile 加载到 postgis 中。总而言之,最快的方法是使用 Python/gdal:
To answer my own question... and in thanks to everybody for their help - The final solution adopted was to use gdal from python which was relatively easily adapted to both rasters and shape files. The some of rasters ran to about 40gb and some of the shape files exceeded 8gb - so there was no way they'd fit in memory on any of the machines we had at the time (Now I've access to a machine with 128gb ram - but I've moved onto new pastures!). The python/gdal code was able to tag 27million points in from 1 minute to 40 minutes depending on the sizes of the polygons in the shapefiles - if there were a lot of small polygons it was stunningly quick - if there were some massive (250k points) polygons in the shapefiles it was stunningly slow! However to compare, we used to run it previously on an oracle spatial database and it would take about 24 hours+ to tag the 27 million points, or rasterizing and tagging would take about an hour. As Spacedman suggested I did try using postgis on my machine with an ssd, but the turn-around time was quite a bit slower than using python/gdal as the final solution didn't require the shapefiles to be loaded into postgis. So to summarise, the quickest way to do this was using Python/gdal: