为什么 NUMPY correlate 和 corrcoef 返回不同的值以及如何“标准化” “完整”的相关性模式？

发布于 2024-11-01 08:18:27 字数 369 浏览 12 评论 0原文

我正在尝试使用 Numpy 在 Python 中使用一些时间序列分析。

我有两个中等规模的系列，每个系列有 20k 个值，我想检查滑动相关性。

corrcoef 为我提供了自相关/相关系数矩阵作为输出。就我而言，它本身没有任何用处，因为其中一个系列包含滞后。

关联函数（在 mode="full" 中）返回一个 40k 元素列表，该列表确实看起来像我想要的结果类型（峰值与滞后指示的列表中心一样远），但是这些值都很奇怪 - 高达 500，而我期望的是 -1 到 1 之间的值。

我不能将其全部除以最大值；我知道最大相关性不是 1。

如何标准化“互相关”（“完整”模式下的相关），以便返回值将是每个滞后步骤的相关性，而不是那些非常大、奇怪的值？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦爱 2024-11-08 08:18:27

您正在寻找标准化互相关。此选项在 Numpy 中尚不可用，但补丁正在等待审核，该补丁可以正是您想要的。我认为应用它应该不会太难。大部分补丁只是文档字符串内容。它添加的唯一代码行是

if normalize:
    a = (a - mean(a)) / (std(a) * len(a))
    v = (v - mean(v)) /  std(v)

其中 a 和 v 是您要查找其互相关的输入 numpy 数组。将它们添加到您自己的 Numpy 发行版中，或者只是复制相关函数并在其中添加行应该不难。如果我选择走这条路，我个人会选择后者。

另一种可能更好的替代方法是在将输入向量发送到进行关联之前对输入向量进行归一化。这取决于您想采用哪种方式。

You are looking for normalized cross-correlation. This option isn't available yet in Numpy, but a patch is waiting for review that does just what you want. It shouldn't be too hard to apply it I would think. Most of the patch is just doc string stuff. The only lines of code that it adds are

if normalize:
    a = (a - mean(a)) / (std(a) * len(a))
    v = (v - mean(v)) /  std(v)

where a and v are the inputted numpy arrays of which you are finding the cross-correlation. It shouldn't be hard to either add them into your own distribution of Numpy or just make a copy of the correlate function and add the lines there. I would do the latter personally if I chose to go this route.

Another, quite possibly better, alternative is to just do the normalization to the input vectors before you send it to correlate. It's up to you which way you would like to do it.

By the way, this does appear to be the correct normalization as per the Wikipedia page on cross-correlation except for dividing by len(a) rather than (len(a)-1). I feel that the discrepancy is akin to the standard deviation of the sample vs. sample standard deviation and really won't make much of a difference in my opinion.

回复收藏 0 原文

十二 2024-11-08 08:18:27

根据这张幻灯片，我建议这样做：

def cross_correlation(a1, a2):
        lags = range(-len(a1)+1, len(a2))
        cs = []
        for lag in lags:
            idx_lower_a1 = max(lag, 0)
            idx_lower_a2 = max(-lag, 0)
            idx_upper_a1 = min(len(a1), len(a1)+lag)
            idx_upper_a2 = min(len(a2), len(a2)-lag)
            b1 = a1[idx_lower_a1:idx_upper_a1]
            b2 = a2[idx_lower_a2:idx_upper_a2]
            c = np.correlate(b1, b2)[0]
            c = c / np.sqrt((b1**2).sum() * (b2**2).sum())
            cs.append(c)
        return cs

According to this slides, I would suggest to do it this way:

def cross_correlation(a1, a2):
        lags = range(-len(a1)+1, len(a2))
        cs = []
        for lag in lags:
            idx_lower_a1 = max(lag, 0)
            idx_lower_a2 = max(-lag, 0)
            idx_upper_a1 = min(len(a1), len(a1)+lag)
            idx_upper_a2 = min(len(a2), len(a2)-lag)
            b1 = a1[idx_lower_a1:idx_upper_a1]
            b2 = a2[idx_lower_a2:idx_upper_a2]
            c = np.correlate(b1, b2)[0]
            c = c / np.sqrt((b1**2).sum() * (b2**2).sum())
            cs.append(c)
        return cs

回复收藏 0 原文

靖瑶 2024-11-08 08:18:27

对于完整模式，直接在滞后信号/特征上计算corrcoef是否有意义？代码

from dataclasses import dataclass
from typing import Any, Optional, Sequence

import numpy as np

ArrayLike = Any


@dataclass
class XCorr:
    cross_correlation: np.ndarray
    lags: np.ndarray


def cross_correlation(
    signal: ArrayLike, feature: ArrayLike, lags: Optional[Sequence[int]] = None
) -> XCorr:
    """
    Computes normalized cross correlation between the `signal` and the `feature`.
    Current implementation assumes the `feature` can't be longer than the `signal`.
    You can optionally provide specific lags, if not provided `signal` is padded
    with the length of the `feature` - 1, and the `feature` is slid/padded (creating lags)
    with 0 padding to match the length of the new signal. Pearson product-moment
    correlation coefficients is computed for each lag.

    See: https://en.wikipedia.org/wiki/Cross-correlation

    :param signal: observed signal
    :param feature: feature you are looking for
    :param lags: optional lags, if not provided equals to (-len(feature), len(signal))
    """
    signal_ar = np.asarray(signal)
    feature_ar = np.asarray(feature)
    if np.count_nonzero(feature_ar) == 0:
        raise ValueError("Unsupported - feature contains only zeros")
    assert (
        signal_ar.ndim == feature_ar.ndim == 1
    ), "Unsupported - only 1d signal/feature supported"
    assert len(feature_ar) <= len(
        signal
    ), "Unsupported - signal should be at least as long as the feature"
    padding_sz = len(feature_ar) - 1
    padded_signal = np.pad(
        signal_ar, (padding_sz, padding_sz), "constant", constant_values=0
    )
    lags = lags if lags is not None else range(-padding_sz, len(signal_ar), 1)
    if np.max(lags) >= len(signal_ar):
        raise ValueError("max positive lag must be shorter than the signal")
    if np.min(lags) <= -len(feature_ar):
        raise ValueError("max negative lag can't be longer than the feature")
    assert np.max(lags) < len(signal_ar), ""
    lagged_patterns = np.asarray(
        [
            np.pad(
                feature_ar,
                (padding_sz + lag, len(signal_ar) - lag - 1),
                "constant",
                constant_values=0,
            )
            for lag in lags
        ]
    )
    return XCorr(
        cross_correlation=np.corrcoef(padded_signal, lagged_patterns)[0, 1:],
        lags=np.asarray(lags),
    )

示例：

signal = [0, 0, 1, 0.5, 1, 0, 0, 1]
feature = [1, 0, 0, 1]
xcorr = cross_correlation(signal, feature)
assert xcorr.lags[xcorr.cross_correlation.argmax()] == 4

For a full mode, would it make sense to compute corrcoef directly on the lagged signal/feature? Code

from dataclasses import dataclass
from typing import Any, Optional, Sequence

import numpy as np

ArrayLike = Any


@dataclass
class XCorr:
    cross_correlation: np.ndarray
    lags: np.ndarray


def cross_correlation(
    signal: ArrayLike, feature: ArrayLike, lags: Optional[Sequence[int]] = None
) -> XCorr:
    """
    Computes normalized cross correlation between the `signal` and the `feature`.
    Current implementation assumes the `feature` can't be longer than the `signal`.
    You can optionally provide specific lags, if not provided `signal` is padded
    with the length of the `feature` - 1, and the `feature` is slid/padded (creating lags)
    with 0 padding to match the length of the new signal. Pearson product-moment
    correlation coefficients is computed for each lag.

    See: https://en.wikipedia.org/wiki/Cross-correlation

    :param signal: observed signal
    :param feature: feature you are looking for
    :param lags: optional lags, if not provided equals to (-len(feature), len(signal))
    """
    signal_ar = np.asarray(signal)
    feature_ar = np.asarray(feature)
    if np.count_nonzero(feature_ar) == 0:
        raise ValueError("Unsupported - feature contains only zeros")
    assert (
        signal_ar.ndim == feature_ar.ndim == 1
    ), "Unsupported - only 1d signal/feature supported"
    assert len(feature_ar) <= len(
        signal
    ), "Unsupported - signal should be at least as long as the feature"
    padding_sz = len(feature_ar) - 1
    padded_signal = np.pad(
        signal_ar, (padding_sz, padding_sz), "constant", constant_values=0
    )
    lags = lags if lags is not None else range(-padding_sz, len(signal_ar), 1)
    if np.max(lags) >= len(signal_ar):
        raise ValueError("max positive lag must be shorter than the signal")
    if np.min(lags) <= -len(feature_ar):
        raise ValueError("max negative lag can't be longer than the feature")
    assert np.max(lags) < len(signal_ar), ""
    lagged_patterns = np.asarray(
        [
            np.pad(
                feature_ar,
                (padding_sz + lag, len(signal_ar) - lag - 1),
                "constant",
                constant_values=0,
            )
            for lag in lags
        ]
    )
    return XCorr(
        cross_correlation=np.corrcoef(padded_signal, lagged_patterns)[0, 1:],
        lags=np.asarray(lags),
    )

Example:

signal = [0, 0, 1, 0.5, 1, 0, 0, 1]
feature = [1, 0, 0, 1]
xcorr = cross_correlation(signal, feature)
assert xcorr.lags[xcorr.cross_correlation.argmax()] == 4

回复收藏 0 原文

~没有更多了~