不同大小数组的协方差近似

发布于 2024-12-25 16:40:39 字数 258 浏览 1 评论 0原文

NumPy/SciPy 中是否有任何常用工具可用于计算即使输入变量大小不同也能发挥作用的相关性度量?在协方差和相关性的标准表述中,要求测试中的每个不同变量具有相同数量的观测值。通常,您必须传递一个矩阵,其中每行都是不同的变量,每列代表不同的观察结果。

就我而言,我有 9 个不同的变量,但对于每个变量,观察数不是恒定的。有些变量比其他变量有更多的观察结果。我知道像传感器融合这样的领域正在研究这样的问题,那么有哪些标准工具可以用于计算不同长度的数据系列的关系统计(最好是Python)?

Are there any common tools in NumPy/SciPy for computing a correlation measure that works even when the input variables are differently sized? In the standard formulation of covariance and correlation, one is required to have the same number of observations for each different variable under test. Typically, you must pass a matrix where each row is a different variable and each column represents a distinct observation.

In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are fields like sensor fusion which study problems like this, so what standard tools are out there for computing relational statistics on data series of differing lengths (preferably in Python)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

(り薆情海 2025-01-01 16:40:39

我会检查此页面:

http://docs.scipy .org/doc/numpy/reference/ generated/numpy.ma.cov.html

更新:

假设数据矩阵的每一行对应于一个特定的随机变量,并且该行中的条目是观察值。只要观察之间存在对应关系,您遇到的就是一个简单的缺失数据问题。也就是说,如果你的一行只有10个条目,那么这10个条目(即试验)是否对应于第一行中随机变量的10个样本?例如,假设您有两个温度传感器,它们同时采样,但其中一个出现故障,有时会丢失样本。然后,您应该将故障传感器未能生成读数的试验视为“丢失数据”。在您的情况下,它就像在 NumPy 中创建两个长度相同的向量一样简单,将零(或任何值,实际上)放入与缺失试验相对应的两个向量中较小的一个,然后创建一个掩码矩阵来指示数据矩阵中缺失值存在的位置

向上面链接的函数提供这样的矩阵应该允许您准确地执行您想要的计算。

I would examine this page:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html

UPDATE:

Suppose each row of your data matrix corresponds to a particular random variable, and the entries in the row are observations. What you have is a simple missing data problem, as long as you have a correspondence between the observations. That is to say, if one of your rows has only 10 entries, then do these 10 entries (i.e., trials) correspond to 10 samples of the random variable in the first row? E.g., suppose you have two temperature sensors and they take samples at the same times, but one is faulty and sometimes misses a sample. Then you should treat the trials where the faulty sensor missed generating a reading as "missing data." In your case, it's as simple as creating two vectors in NumPy that are of the same length, putting zeros (or any value, really) in the smaller of the two vectors that correspond to the missing trials, and creating a mask matrix that indicates where your missing values exist in your data matrix.

Supplying such a matrix to the function linked to above should allow you to perform exactly the computation you want.

滥情稳全场 2025-01-01 16:40:39

“问题是每个变量都对应于调查的回答,并不是每个调查接受者都回答了每个问题。因此,我想要一些措施来衡量问题 2 的答案如何影响问题 8 的答案的可能性,例如。”

这就是数据缺失的问题。我认为让人们感到困惑的是你一直提到你的样本有不同的长度。我认为您可能会像这样想象它们:

样本 1:

question number: [1,2,3,4,5]
response       : [1,0,1,1,0]

样本 2:

question number: [2,4,5]
response       : [1,1,0]

样本 2 应该更像这样:

question number: [  1,2,  3,4,5]
response       : [NaN,1,NaN,1,0]

重要的是问题编号,而不是回答的问题数量。如果没有问题与问题的对应关系,就不可能计算协方差矩阵之类的东西。

无论如何,ddodev 提到的 numpy.ma.cov 函数利用优势来计算协方差事实上,被求和的元素每个仅取决于两个值。

所以它会计算它可以计算的。然后,当涉及除以 n 的步骤时,它除以计算出的值的数量(针对特定协方差矩阵元素),而不是样本总数。

"The issue is that each variable corresponds to the response on a survey, and not every survey taker answered every question. Thus, I want some measure of how an answer to question 2, say, affects likelihood of answers to question 8, for example."

This is the missing data problem. I think what's confusing people is that you keep referring to your samples as having different lengths. I think you might be visualizing them like this:

sample 1:

question number: [1,2,3,4,5]
response       : [1,0,1,1,0]

sample 2:

question number: [2,4,5]
response       : [1,1,0]

when sample 2 should be more like this:

question number: [  1,2,  3,4,5]
response       : [NaN,1,NaN,1,0]

It's the question number, not the number of questions answered that's important. Without question-to-question correspondence it's impossible to calculate anything like a covariance matrix.

Anyway, that numpy.ma.cov function that ddodev mentioned calculates the covariance, by taking advantage of the fact that the elements being summed, each only depend on two values.

So it calculates the ones it can. Then when it comes to the step of dividing by n, it divides by the number of values that were calculated (for that particular covvariance-matrix element), instead of the total number of samples.

疾风者 2025-01-01 16:40:39

从纯粹的数学角度来看,我相信它们必须是相同的。为了使它们相同,您可以应用一些与缺失数据问题相关的概念。我想我是说,如果向量大小不同,它就不再是严格的协方差。无论您使用什么工具,都只会以某种智能方式弥补一些点,以使向量长度相等。

From a purely mathmatical point of view, I believe they have to be the same. To make them the same you can apply some concepts related to the missing data problem. I guess I am saying it is not strictly a covariance anymore if the vectors aren't the same size. Whatever tool you use will just make up some points in some smart way to make the vectors of equal length.

蓬勃野心 2025-01-01 16:40:39

这是我对这个问题的看法。严格来说,计算 2 个随机变量的协方差的公式 Cov(X,Y) = E[XY] - E[X]E[Y] 并没有告诉您有关样本大小或如何计算的任何信息X 和 Y 应形成一个随机向量(即 x_i 和 y_i 不明确成对出现)。

无论 X 和 Y 的观测值数量不匹配,E[X]E[Y] 都会按照通常的方式计算。对于E[XY],在X和Y分别采样的情况下,可以理解为“x_i * y_j所有可能组合的平均值”,换句话说:

# NumPy code :
import numpy as np

X = ... # your first data sample
Y = ... # your second data sample

E_XY = np.outer(X, Y).ravel().mean()

Here's my take on the question. Strictly speaking, the formula for computing the covariance of 2 random variables Cov(X,Y) = E[XY] - E[X]E[Y] does not tell you anything about sample sizes or how X and Y should form a random vector (i.e. x_i's and y_i's do not explicitly come in pairs).

E[X] and E[Y] are computed the usual way, no matter that the number of observations for X and Y do not match. As for E[XY], in the case of separately sampled X and Y, you can take it as meaning "the mean of all possible combinations of x_i * y_j", in other words:

# NumPy code :
import numpy as np

X = ... # your first data sample
Y = ... # your second data sample

E_XY = np.outer(X, Y).ravel().mean()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文