在 pylab 程序(也可能是 matlab 程序)中,我有一个代表距离的 numpy 数字数组:d[t]
是时间 的距离 >t
(我的数据的时间跨度是 len(d) 时间单位)。
我感兴趣的事件是距离低于某个阈值时的事件,并且我想计算这些事件的持续时间。 使用b = d很容易获得布尔值数组,问题归结为计算b
中仅True单词的长度序列。 但我不知道如何有效地做到这一点(即使用 numpy 原语),我求助于遍历数组并进行手动更改检测(即当值从 False 变为 True 时初始化计数器,只要值为 True 就增加计数器,并在值返回到 False 时将计数器输出到序列中)。 但这非常慢。
如何有效地检测 numpy 数组中的这种序列?
下面是一些 python 代码,说明了我的问题:第四个点需要很长时间才能出现(如果没有,请增加数组的大小)
from pylab import *
threshold = 7
print '.'
d = 10*rand(10000000)
print '.'
b = d<threshold
print '.'
durations=[]
for i in xrange(len(b)):
if b[i] and (i==0 or not b[i-1]):
counter=1
if i>0 and b[i-1] and b[i]:
counter+=1
if (b[i-1] and not b[i]) or i==len(b)-1:
durations.append(counter)
print '.'
In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t]
is the distance at time t
(and the timespan of my data is len(d)
time units).
The events I'm interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It's easy to get an array of booleans with b = d<threshold
, and the problem comes down to computing the sequence of the lengths of the True-only words in b
. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.
How to efficienly detect that sort of sequences in numpy arrays ?
Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)
from pylab import *
threshold = 7
print '.'
d = 10*rand(10000000)
print '.'
b = d<threshold
print '.'
durations=[]
for i in xrange(len(b)):
if b[i] and (i==0 or not b[i-1]):
counter=1
if i>0 and b[i-1] and b[i]:
counter+=1
if (b[i-1] and not b[i]) or i==len(b)-1:
durations.append(counter)
print '.'
发布评论
评论(7)
适用于任何数组的完全 numpy 矢量化和通用 RLE(也适用于字符串、布尔值等)。
输出游程长度、起始位置和值的元组。
相当快(i7):
多种数据类型:
与上面的 Alex Martelli 结果相同:
比 Alex 稍慢(但仍然非常快),并且更灵活。
Fully numpy vectorized and generic RLE for any array (works with strings, booleans etc too).
Outputs tuple of run lengths, start positions, and values.
Pretty fast (i7):
Multiple data types:
Same results as Alex Martelli above:
Slightly slower than Alex (but still very fast), and much more flexible.
虽然不是 numpy 基元,但 itertools 函数通常非常快,因此请尝试一下(当然,并测量包括此在内的各种解决方案的时间):
如果您确实需要列表中的值,当然可以使用 list(runs_of_ones(bits)) ; 但也许列表理解可能会稍微快一些:
转向“numpy-native”可能性,怎么样:
再次强调:一定要在现实的示例中对彼此的解决方案进行基准测试!
While not
numpy
primitives,itertools
functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:
Moving to "numpy-native" possibilities, what about:
Again: be sure to benchmark solutions against each others in realistic-for-you examples!
这是一个仅使用数组的解决方案:它采用一个包含布尔序列的数组并计算转换的长度。
sw
在有开关的地方包含 true,isw
将它们转换为索引。 然后,isw 的项在lens 中成对相减。请注意,如果序列以 1 开头,它将计算 0 序列的长度:这可以在索引中修复以计算镜头。 另外,我还没有测试极端情况,例如长度为 1 的序列。
返回所有 True 子数组的起始位置和长度的完整函数。
测试不同的布尔一维数组(空数组;单个/多个元素;偶数/奇数长度;以
True
/False
开始;仅使用True
/False
元素)。Here is a solution using only arrays: it takes an array containing a sequence of bools and counts the length of the transitions.
sw
contains a true where there is a switch,isw
converts them in indexes. The items of isw are then subtracted pairwise inlens
.Notice that if the sequence started with an 1 it would count the length of the 0s sequences: this can be fixed in the indexing to compute lens. Also, I have not tested corner cases such sequences of length 1.
Full function that returns start positions and lengths of all
True
-subarrays.Tested for different bool 1D-arrays (empty array; single/multiple elements; even/odd lengths; started with
True
/False
; with onlyTrue
/False
elements).以防万一有人好奇(既然您顺便提到了 MATLAB),这里有一种在 MATLAB 中解决它的方法:
我对 Python 不太熟悉,但这也许可以帮助您提供一些想法。 =)
Just in case anyone is curious (and since you mentioned MATLAB in passing), here's one way to solve it in MATLAB:
I'm not too familiar with Python, but maybe this could help give you some ideas. =)
也许有点晚了,但总体来说基于 Numba 的方法将是最快的。
基于 Numpy 的方法(受到 @ThomasBrowne 答案 的启发,但速度更快,因为使用了昂贵的
numpy.concatenate()
减少到最低限度)是亚军(这里提出了两种类似的方法,一种使用不等式来查找步骤的位置,另一种使用差异):这些都优于朴素和简单的单循环方法:
相反,使用
itertools.groupby()
不会比简单循环更快(除非在非常特殊的情况下,例如 @AlexMartelli 回答 或者有人会在组对象上实现__len__
),因为通常没有简单的方法来提取组大小信息比循环遍历组本身,这并不快:报告了对不同大小的随机整数数组的一些基准测试的结果:
(完整分析此处)。
Perhaps late to the party, but a Numba-based approach is going to be fastest by far and large.
Numpy-based approaches (inspired by @ThomasBrowne answer but faster because the use of the expensive
numpy.concatenate()
is reduced to a minimum) are the runner-up (here two similar approaches are presented, one using not-equality to find the positions of the steps, and the other one using differences):These both outperform the naïve and simple single loop approach:
On the contrary, using
itertools.groupby()
is not going to be any faster than the simple loop (unless on very special cases like in @AlexMartelli answer or someone will implement__len__
on the group object) because in general there is no simple way of extracting the group size information other than looping through the group itself, which is not exactly fast:The results of some benchmarks on random integer arrays of varying size are reported:
(Full analysis here).