MATLAB:计算时间序列的每个 1 分钟间隔的平均值
我有一堆时间序列,每个时间序列由两个组件描述,一个时间戳向量(以秒为单位)和一个测量值向量。时间向量是不均匀的(即以非规则间隔采样)
我试图计算每个 1 分钟间隔值的平均值/SD(采用 X 分钟间隔,计算其平均值,采用下一个间隔,.. .)。
我当前的实现使用循环。这是我到目前为止所拥有的示例:
t = (100:999)' + rand(900,1); %' non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
interval = 1; % 1-min interval
tt = ( floor(t(1)):interval*60:ceil(t(end)) )'; %' stopping points of each interval
N = length(tt)-1;
mu = zeros(N,1);
sd = zeros(N,1);
for i=1:N
indices = ( tt(i) <= t & t < tt(i+1) ); % find t between tt(i) and tt(i+1)
mu(i) = mean( x(indices) );
sd(i) = std( x(indices) );
end
我想知道是否有更快的矢量化解决方案。这很重要,因为我有大量时间序列需要处理,每个时间序列比上面显示的示例要长得多。
欢迎任何帮助。
感谢大家的反馈。
我更正了 t
的生成方式,使其始终单调递增(排序),这并不是真正的问题。
另外,我可能没有明确说明这一点,但我的目的是为任何问题提供解决方案间隔长度(以分钟为单位)(1 分钟只是一个示例)
I have a bunch of times-series each described by two components, a timestamp vector (in seconds), and a vector of values measured. The time vector is non-uniform (i.e. sampled at non-regular intervals)
I am trying to compute the mean/SD of each 1-minutes interval of values (take X minute interval, compute its mean, take the next interval, ...).
My current implementation uses loops. This is a sample of what I have so far:
t = (100:999)' + rand(900,1); %' non-uniform time
x = 5*rand(900,1) + 10; % x(i) is the value at time t(i)
interval = 1; % 1-min interval
tt = ( floor(t(1)):interval*60:ceil(t(end)) )'; %' stopping points of each interval
N = length(tt)-1;
mu = zeros(N,1);
sd = zeros(N,1);
for i=1:N
indices = ( tt(i) <= t & t < tt(i+1) ); % find t between tt(i) and tt(i+1)
mu(i) = mean( x(indices) );
sd(i) = std( x(indices) );
end
I am wondering if there a faster vectorized solution. This is important because I have a large number of time-series to process each much longer than the sample shown above..
Any help is welcome.
Thank you all for the feedback.
I corrected the way t
is generated to be always monotonically increasing (sorted), this was not really an issue..
Also, I may not have stated this clearly but my intention was to have a solution for any interval length in minutes (1-min was just an example)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
唯一合乎逻辑的解决方案似乎是......
好吧。我觉得很有趣的是,对我来说只有一种合乎逻辑的解决方案,但许多其他人找到了其他解决方案。无论如何,解决方案看起来确实很简单。给定向量 x 和 t,以及一组等距断点 tt,
(请注意,我在上面对 t 进行了排序。)
我将用三行完全矢量化的代码来完成此操作。首先,如果中断是任意的并且间距可能不相等,我将使用 histc 来确定数据系列属于哪个间隔。鉴于它们是均匀的,只需执行以下操作:
同样,如果不知道 t 的元素已排序,我会使用 min(t) 而不是 t(1)。完成此操作后,使用 Accumarray 将结果减少为平均值和标准差。
The only logical solution seems to be...
Ok. I find it funny that to me there is only one logical solution, but many others find other solutions. Regardless, the solution does seem simple. Given the vectors x and t, and a set of equally spaced break points tt,
(Note that I sorted t above.)
I would do this in three fully vectorized lines of code. First, if the breaks were arbitrary and potentially unequal in spacing, I would use histc to determine which intervals the data series falls in. Given they are uniform, just do this:
Again, if the elements of t were not known to be sorted, I would have used min(t) instead of t(1). Having done that, use accumarray to reduce the results into a mean and standard deviation.
您可以尝试创建一个元胞数组并通过 cellfun 应用平均值和标准差。对于 900 个条目,它比您的解决方案慢约 10%,但对于 90000 个条目,速度快约 10 倍。
注意:我的解决方案不会给出与您的完全相同的结果,因为您在最后跳过了一些时间值(1:60:90 是 [1,61]),并且由于间隔的开始并不完全相同。
You could try and create a cell array and apply mean and std via cellfun. It's ~10% slower than your solution for 900 entries, but ~10x faster for 90000 entries.
Note: my solution does not give the exact same results as yours, since you skip a few time values at the end (1:60:90 is [1,61]), and since the start of the interval is not exactly the same.
这是一种使用二分搜索的方法。对于 9900 个元素,速度提高了 6-10 倍;对于 99900 个元素,速度提高了约 64 倍。仅使用 900 个元素很难获得可靠的时间,因此我不确定在该尺寸下哪个更快。如果您考虑直接从生成的数据进行交易,它几乎不使用额外的内存。除此之外,它只有四个额外的浮点变量(prevind、first、mid 和 last)。
它使用您最初拥有的所有变量。我希望它适合您的需求。它更快,因为使用二分搜索查找索引需要 O(log N) 时间,但按照您的方式查找索引需要 O(N) 时间。
Here's a way that uses binary search. It is 6-10x faster for 9900 elements and about 64x times faster for 99900 elements. It was hard to get reliable times using only 900 elements so I'm not sure which is faster at that size. It uses almost no extra memory if you consider making tx directly from the generated data. Other than that it just has four extra float variables (prevind, first, mid, and last).
It uses all of the variables that you had originally. I hope that it suits your needs. It is faster because it takes O(log N) to find the indices with binary search, but O(N) to find them the way you were doing it.
您可以使用 bsxfun 一次性计算所有索引:
这比循环更快,但需要一次存储所有索引(时间与空间权衡)。
You can compute
indices
all at once using bsxfun:This is faster than looping but requires storing them all at once (time vs space tradeoff)..
免责声明:我在纸上解决了这个问题,但还没有机会“在计算机中”检查它...
您可以通过执行一些棘手的累积来避免循环或使用元胞数组自己求和、索引并计算平均值和标准差。下面是一些我相信可以工作的代码,尽管我不确定它如何与其他解决方案相比提高速度:
上面使用 此维基百科页面上找到的公式的简化。
Disclaimer: I worked this out on paper, but haven't yet had the opportunity to check it "in silico"...
You may be able to avoid loops or using cell arrays by doing some tricky cumulative sums, indexing, and calculating the means and standard deviations yourself. Here's some code that I believe will work, although I am unsure how it stacks up speed-wise to the other solutions:
The above computes the standard deviation using the simplification of the formula found on this Wikipedia page.
与上面相同的答案,但具有参数间隔(
window_size
)。矢量长度的问题也得到解决。
The same answer as above but with the parametric interval (
window_size
).Issue with the vector lengths solved as well.