numpy 数组:快速填充和提取数据
请参阅此问题底部的重要说明。
我正在使用 numpy 来加速经度/纬度坐标的某些处理。不幸的是,我的 numpy“优化”使我的代码运行速度比不使用 numpy 的运行速度慢大约 5 倍。
瓶颈似乎在于用我的数据填充 numpy 数组,然后在完成数学转换后提取该数据。为了填充数组,我基本上有一个像这样的循环:
point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )
for point_index in xrange( 0, n ):
point_buffer[ point_index ] = point_list[ point_index ]
该循环,只是在对其进行操作之前填充 numpy 数组,速度非常慢,比没有 numpy 的整个计算慢得多。 (也就是说,这不仅仅是 python 循环本身的缓慢,而且显然在实际将每个小数据块从 python 传输到 numpy 时存在巨大的开销。)另一端也有类似的缓慢;处理完 numpy 数组后,我在循环中访问每个修改后的坐标对,同样,
some_python_tuple = point_buffer[ index ]
提取数据的循环比没有 numpy 的整个原始计算慢得多。那么,我如何实际填充 numpy 数组并从 numpy 数组中提取数据,而不会违背首先使用 numpy 的目的?
我正在使用 C 库从形状文件中读取数据,该库将数据作为常规 python 列表提供。我知道,如果库将已经存在于 numpy 数组中的坐标交给我,则不需要“填充”numpy 数组。但不幸的是,我的数据起点是一个常规的 python 列表。更重要的是,一般来说,我想了解如何使用 python 中的数据快速填充 numpy 数组。
澄清
上面显示的循环实际上过于简单化了。我在这个问题中这样写是因为我想专注于我所看到的尝试在循环中缓慢填充 numpy 数组的问题。我现在明白这样做只是缓慢的。
在我的实际应用程序中,我拥有的是坐标点的形状文件,并且我有一个 API 来检索给定对象的点。大约有 200,000 个对象。因此,我重复调用函数 GetShapeCoords( i )
来获取对象 i 的坐标。这将返回一个列表列表,其中每个子列表都是经度/纬度对的列表,它是列表列表的原因是某些对象是多部分的(即多多边形)。然后,在我的原始代码中,当我读取每个对象的点时,我通过调用常规 python 函数对每个点进行转换,然后使用 PIL 绘制转换后的点。整个过程花了大约 20 秒来绘制所有 200,000 个多边形。并不可怕,但还有很大的改进空间。我注意到这 20 秒中至少有一半花在了转换逻辑上,所以我想我应该在 numpy 中这样做。我最初的实现只是一次读入一个对象,然后将子列表中的所有点不断附加到一个大的 numpy 数组中,然后我可以在 numpy 中进行数学运算。
所以,我现在明白,简单地将整个 python 列表传递给 numpy 是设置大数组的正确方法。但就我而言,我一次只读取一个对象。所以我能做的一件事就是在一个大的Python列表列表中不断地添加点。然后,当我以这种方式编译了一些大量对象的点(例如 10000 个对象)时,我可以简单地将怪物列表分配给 numpy。
所以我现在的问题是三个部分:
(a) numpy 真的可以接受那么大的、不规则形状的列表列表,并快速快速地读取它吗?
(b) 然后我希望能够变换该怪物树叶子中的所有点。例如,获取 numpy 的表达式是什么,“进入每个子列表,然后进入每个子子列表,然后对于在这些子子列表中找到的每个坐标对,将第一个(lon 坐标)乘以 0.5”?我可以这样做吗?
(c) 最后,我需要取回这些转换后的坐标以便绘制它们。
温斯顿下面的回答似乎暗示了我如何使用 itertools 来完成这一切。我想做的与温斯顿所做的非常相似,将列表展平。但我不能完全把它弄平。当我去绘制数据时,我需要能够知道一个多边形何时停止以及下一个多边形何时开始。所以,我想如果有一种方法可以用特殊的坐标对(例如(-1000,-1000)或类似的东西)快速标记每个多边形(即每个子列表)的末端,我就可以让它工作。然后我可以像温斯顿的答案一样用 itertools 进行展平,然后在 numpy 中进行转换。然后我需要使用 PIL 从点到点进行实际绘制,在这里我想我需要将修改后的 numpy 数组重新分配回 python 列表,然后在常规 python 循环中迭代该列表来进行绘图。除了编写一个 C 模块来一步处理所有阅读和绘图之外,这似乎是我的最佳选择吗?
See important clarification at bottom of this question.
I am using numpy to speed up some processing of longitude/latitude coordinates. Unfortunately, my numpy "optimizations" made my code run about 5x more slowly than it ran without using numpy.
The bottleneck seems to be in filling the numpy array with my data, and then extracting out that data after I have done the mathematical transformations. To fill the array I basically have a loop like:
point_list = GetMyPoints() # returns a long list of ( lon, lat ) coordinate pairs
n = len( point_list )
point_buffer = numpy.empty( ( n, 2 ), numpy.float32 )
for point_index in xrange( 0, n ):
point_buffer[ point_index ] = point_list[ point_index ]
That loop, just filling in the numpy array before even operating on it, is extremely slow, much slower than the entire computation was without numpy. (That is, it's not just the slowness of the python loop itself, but apparently some huge overhead in actually transferring each small block of data from python to numpy.) There is similar slowness on the other end; after I have processed the numpy arrays, I access each modified coordinate pair in a loop, again as
some_python_tuple = point_buffer[ index ]
Again that loop to pull the data out is much slower than the entire original computation without numpy. So, how do I actually fill the numpy array and extract data from the numpy array in a way that doesn't defeat the purpose of using numpy in the first place?
I am reading the data from a shape file using a C library that hands me the data as a regular python list. I understand that if the library handed me the coordinates already in a numpy array there would be no "filling" of the numpy array necessary. But unfortunately the starting point for me with the data is as a regular python list. And more to the point, in general I want to understand how you quickly fill a numpy array with data from within python.
Clarification
The loop shown above is actually oversimplified. I wrote it that way in this question because I wanted to focus on the problem I was seeing of trying to fill a numpy array slowly in a loop. I now understand that doing that is just slow.
In my actual application what I have is a shape file of coordinate points, and I have an API to retrieve the points for a given object. There are something like 200,000 objects. So I repeatedly call a function GetShapeCoords( i )
to get the coords for object i. This returns a list of lists, where each sublist is a list of lon/lat pairs, and the reason it's a list of lists is that some of the objects are multi-part (i.e., multi-polygon). Then, in my original code, as I read in each object's points, I was doing a transformation on each point by calling a regular python function, and then plotting the transformed points using PIL. The whole thing took about 20 seconds to draw all 200,000 polygons. Not terrible, but much room for improvement. I noticed that at least half of those 20 seconds were spent doing the transformation logic, so I thought I'd do that in numpy. And my original implementation was just to read in the objects one at a time, and keep appending all the points from the sublists into one big numpy array, which I then could do the math stuff on in numpy.
So, I now understand that simply passing a whole python list to numpy is the right way to set up a big array. But in my case I only read one object at a time. So one thing I could do is keep appending points together in a big python list of lists of lists. And then when I've compiled some large number of objects' points in this way (say, 10000 objects), I could simply assign that monster list to numpy.
So my question now is three parts:
(a) Is it true that numpy can take that big, irregularly shaped, list of lists of lists, and slurp it okay and quickly?
(b) I then want to be able to transform all the points in the leaves of that monster tree. What is the expression to get numpy to, for instance, "go into each sublist, and then into each subsublist, and then for each coordinate pair you find in those subsublists multiply the first (lon coordinate) by 0.5"? Can I do that?
(c) Finally, I need to get those transformed coordinates back out in order to plot them.
Winston's answer below seems to give some hint at how I might do this all using itertools. What I want to do is pretty much like what Winston does, flattening the list out. But I can't quite just flatten it out. When I go to draw the data, I need to be able to know when one polygon stops and the next starts. So, I think I could make it work if there were a way to quickly mark the end of each polygon (i.e., each subsublist) with a special coordinate pair like (-1000, -1000) or something like that. Then I could flatten with itertools as in Winston's answer, and then do the transforms in numpy. Then I need to actually draw from point to point using PIL, and here I think I'd need to reassign the modified numpy array back to a python list, and then iterate through that list in a regular python loop to do the drawing. Does that seem like my best option short of just writing a C module to handle all the reading and drawing for me in one step?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您将您的数据描述为“坐标列表列表的列表”。由此我猜测你的提取看起来像这样:
这样做:
itertools和numpy.fromiter都是用c实现的并且非常高效。因此,这应该会非常快地完成转换。
您问题的第二部分并没有真正表明您想要如何处理数据。索引 numpy 数组比索引 python 列表慢。您可以通过对数据进行大量操作来提高速度。如果不了解更多关于您如何处理这些数据的信息,就很难建议如何修复它。
更新:
我已经使用 itertools 和 numpy 完成了所有工作。对于因尝试理解此代码而造成的任何脑损伤,我不承担任何责任。
您可能会发现最好一次处理一个多边形。将每个多边形转换为 numpy 数组并对其进行向量运算。这样做您可能会获得显着的速度优势。将所有数据放入 numpy 可能有点困难。
由于数据形状奇特,这比大多数 numpy 的东西更困难。 Numpy 几乎假设一个数据形状统一的世界。
You describe your data as being "lists of lists of lists of coordinates". From this I'm guessing your extraction looks like this:
Do this:
itertools and numpy.fromiter are both implemented in c and really efficient. As a result, this should do the transformation very quickly.
The second part of your question doesn't really indicate what you want do with the data. Indexing numpy array is slower then indexing python lists. You get speed by performing operations in mass on the data. Without knowing more about what you are doing with that data, its hard to suggest how to fix it.
UPDATE:
I've gone ahead and done everything using itertools and numpy. I am not responsible from any brain damage resulting from attempting to understand this code.
You might find it best to deal with a single polygon at a time. Convert each polygon into a numpy array and do the vector operations on that. You'll probably get a significant speed advantage just doing that. Putting all of your data into numpy might be a little difficult.
This is more difficult then most numpy stuff because of your oddly shaped data. Numpy pretty much assumes a world of uniformly shaped data.
使用 numpy 数组的目的是尽可能避免 for 循环。自己编写 for 循环会导致代码变慢,但使用 numpy 数组,您可以使用预定义的向量化函数,这会更快(也更容易!)。
因此,为了将列表转换为数组,您可以使用:
如果列表包含像
(lat, lon)
这样的元素,那么这将被转换为具有两列的数组。使用该 numpy 数组,您可以轻松地一次操作所有元素。例如,如您的问题所示,要将每个坐标对的第一个元素乘以 0.5,您可以简单地执行以下操作(假设第一个元素位于第一列中):
The point of using numpy arrays is to avoid as much as possible for loops. Writing for loops yourself will result in slow code, but with numpy arrays you can use predefined vectorized functions which are much faster (and easier!).
So for the conversion of a list to an array you can use:
If the list contains elements like
(lat, lon)
, then this will be converted to an array with two columns.With that numpy array you can easily manipulate all elements at once. For example, to multiply the first element of each coordinate pair by 0.5 as in your question, you can do simply (assuming that the first elements are eg in the first column):
这会更快:
修改数组,而不是列表。如果可能的话,显然最好首先避免创建列表。
编辑 1:分析
这里有一些测试代码,演示了 numpy 如何有效地将列表转换为数组(这很好)。我的列表到缓冲区的想法只能与 numpy 的想法相媲美,而不是更好。
结果:
编辑2:关于数据的层次性质
如果我理解数据始终是列表列表的列表(对象-多边形-坐标),那么这就是我要采取的方法:将数据减少到创建方阵(在本例中为二维)的最低维度,并使用单独的数组跟踪更高级别分支的索引。这本质上是 Winston 使用 itertools 链对象的 numpy.fromiter 的想法的实现。唯一添加的想法是分支索引。
This will be faster:
Modifiy the array, not the list. It would obviously be better to avoid creating the list in the first place if possible.
Edit 1: profiling
Here is some test code that demonstrates just how efficiently numpy converts lists to arrays (it's good). And that my list-to-buffer idea is only comparable to what numpy does, not better.
results:
Edit 2: Regarding the hierarchical nature of the data
If i understand that the data is always a list of lists of lists (object - polygon - coordinate), then this is the approach I'd take: Reduce the data to the lowest dimension that creates a square array (2D in this case) and track the indices of the higher-level branches with a separate array. This is essentially an implementation of Winston's idea of using numpy.fromiter of a itertools chain object. The only added idea is the branch indexing.