为什么 NUMPY correlate 和 corrcoef 返回不同的值以及如何“标准化” “完整”的相关性模式?
我正在尝试使用 Numpy 在 Python 中使用一些时间序列分析。
我有两个中等规模的系列,每个系列有 20k 个值,我想检查滑动相关性。
corrcoef 为我提供了自相关/相关系数矩阵作为输出。就我而言,它本身没有任何用处,因为其中一个系列包含滞后。
关联函数(在 mode="full" 中)返回一个 40k 元素列表,该列表确实看起来像我想要的结果类型(峰值与滞后指示的列表中心一样远),但是这些值都很奇怪 - 高达 500,而我期望的是 -1 到 1 之间的值。
我不能将其全部除以最大值;我知道最大相关性不是 1。
如何标准化“互相关”(“完整”模式下的相关),以便返回值将是每个滞后步骤的相关性,而不是那些非常大、奇怪的值?
I'm trying to use some Time Series Analysis in Python, using Numpy.
I have two somewhat medium-sized series, with 20k values each and I want to check the sliding correlation.
The corrcoef gives me as output a Matrix of auto-correlation/correlation coefficients. Nothing useful by itself in my case, as one of the series contains a lag.
The correlate function (in mode="full") returns a 40k elements list that DO look like the kind of result I'm aiming for (the peak value is as far from the center of the list as the Lag would indicate), but the values are all weird - up to 500, when I was expecting something from -1 to 1.
I can't just divide it all by the max value; I know the max correlation isn't 1.
How could I normalize the "cross-correlation" (correlation in "full" mode) so the return values would be the correlation on each lag step instead those very large, strange values?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您正在寻找标准化互相关。此选项在 Numpy 中尚不可用,但补丁正在等待审核,该补丁可以正是您想要的。我认为应用它应该不会太难。大部分补丁只是文档字符串内容。它添加的唯一代码行是
其中 a 和 v 是您要查找其互相关的输入 numpy 数组。将它们添加到您自己的 Numpy 发行版中,或者只是复制相关函数并在其中添加行应该不难。如果我选择走这条路,我个人会选择后者。
另一种可能更好的替代方法是在将输入向量发送到进行关联之前对输入向量进行归一化。这取决于您想采用哪种方式。
顺便说一句,根据 关于交叉相关的维基百科页面,这似乎是正确的标准化 除外,除以
len(a)
而不是(len(a)-1)
。我觉得这种差异类似于样本标准差与样本标准差,而且确实在我看来不会有太大区别。You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are
where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.
Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.
By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by
len(a)
rather than(len(a)-1)
. I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.根据这张幻灯片,我建议这样做:
According to this slides, I would suggest to do it this way:
对于
完整
模式,直接在滞后信号/特征上计算corrcoef
是否有意义?代码示例:
For a
full
mode, would it make sense to computecorrcoef
directly on the lagged signal/feature? CodeExample: