不同大小数组的协方差近似
NumPy/SciPy 中是否有任何常用工具可用于计算即使输入变量大小不同也能发挥作用的相关性度量?在协方差和相关性的标准表述中,要求测试中的每个不同变量具有相同数量的观测值。通常,您必须传递一个矩阵,其中每行都是不同的变量,每列代表不同的观察结果。
就我而言,我有 9 个不同的变量,但对于每个变量,观察数不是恒定的。有些变量比其他变量有更多的观察结果。我知道像传感器融合这样的领域正在研究这样的问题,那么有哪些标准工具可以用于计算不同长度的数据系列的关系统计(最好是Python)?
Are there any common tools in NumPy/SciPy for computing a correlation measure that works even when the input variables are differently sized? In the standard formulation of covariance and correlation, one is required to have the same number of observations for each different variable under test. Typically, you must pass a matrix where each row is a different variable and each column represents a distinct observation.
In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are fields like sensor fusion which study problems like this, so what standard tools are out there for computing relational statistics on data series of differing lengths (preferably in Python)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会检查此页面:
http://docs.scipy .org/doc/numpy/reference/ generated/numpy.ma.cov.html
更新:
假设数据矩阵的每一行对应于一个特定的随机变量,并且该行中的条目是观察值。只要观察之间存在对应关系,您遇到的就是一个简单的缺失数据问题。也就是说,如果你的一行只有10个条目,那么这10个条目(即试验)是否对应于第一行中随机变量的10个样本?例如,假设您有两个温度传感器,它们同时采样,但其中一个出现故障,有时会丢失样本。然后,您应该将故障传感器未能生成读数的试验视为“丢失数据”。在您的情况下,它就像在 NumPy 中创建两个长度相同的向量一样简单,将零(或任何值,实际上)放入与缺失试验相对应的两个向量中较小的一个,然后创建一个掩码矩阵来指示数据矩阵中缺失值存在的位置。
向上面链接的函数提供这样的矩阵应该允许您准确地执行您想要的计算。
I would examine this page:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html
UPDATE:
Suppose each row of your data matrix corresponds to a particular random variable, and the entries in the row are observations. What you have is a simple missing data problem, as long as you have a correspondence between the observations. That is to say, if one of your rows has only 10 entries, then do these 10 entries (i.e., trials) correspond to 10 samples of the random variable in the first row? E.g., suppose you have two temperature sensors and they take samples at the same times, but one is faulty and sometimes misses a sample. Then you should treat the trials where the faulty sensor missed generating a reading as "missing data." In your case, it's as simple as creating two vectors in NumPy that are of the same length, putting zeros (or any value, really) in the smaller of the two vectors that correspond to the missing trials, and creating a mask matrix that indicates where your missing values exist in your data matrix.
Supplying such a matrix to the function linked to above should allow you to perform exactly the computation you want.
这就是数据缺失的问题。我认为让人们感到困惑的是你一直提到你的样本有不同的长度。我认为您可能会像这样想象它们:
样本 1:
样本 2:
样本 2 应该更像这样:
重要的是问题编号,而不是回答的问题数量。如果没有问题与问题的对应关系,就不可能计算协方差矩阵之类的东西。
无论如何,ddodev 提到的 numpy.ma.cov 函数利用优势来计算协方差事实上,被求和的元素每个仅取决于两个值。
所以它会计算它可以计算的。然后,当涉及除以 n 的步骤时,它除以计算出的值的数量(针对特定协方差矩阵元素),而不是样本总数。
This is the missing data problem. I think what's confusing people is that you keep referring to your samples as having different lengths. I think you might be visualizing them like this:
sample 1:
sample 2:
when sample 2 should be more like this:
It's the question number, not the number of questions answered that's important. Without question-to-question correspondence it's impossible to calculate anything like a covariance matrix.
Anyway, that
numpy.ma.cov
function that ddodev mentioned calculates the covariance, by taking advantage of the fact that the elements being summed, each only depend on two values.So it calculates the ones it can. Then when it comes to the step of dividing by n, it divides by the number of values that were calculated (for that particular covvariance-matrix element), instead of the total number of samples.
从纯粹的数学角度来看,我相信它们必须是相同的。为了使它们相同,您可以应用一些与缺失数据问题相关的概念。我想我是说,如果向量大小不同,它就不再是严格的协方差。无论您使用什么工具,都只会以某种智能方式弥补一些点,以使向量长度相等。
From a purely mathmatical point of view, I believe they have to be the same. To make them the same you can apply some concepts related to the missing data problem. I guess I am saying it is not strictly a covariance anymore if the vectors aren't the same size. Whatever tool you use will just make up some points in some smart way to make the vectors of equal length.
这是我对这个问题的看法。严格来说,计算 2 个随机变量的协方差的公式
Cov(X,Y) = E[XY] - E[X]E[Y]
并没有告诉您有关样本大小或如何计算的任何信息X 和 Y 应形成一个随机向量(即 x_i 和 y_i 不明确成对出现)。无论 X 和 Y 的观测值数量不匹配,
E[X]
和E[Y]
都会按照通常的方式计算。对于E[XY]
,在X和Y分别采样的情况下,可以理解为“x_i * y_j
所有可能组合的平均值”,换句话说:Here's my take on the question. Strictly speaking, the formula for computing the covariance of 2 random variables
Cov(X,Y) = E[XY] - E[X]E[Y]
does not tell you anything about sample sizes or how X and Y should form a random vector (i.e.x_i
's andy_i
's do not explicitly come in pairs).E[X]
andE[Y]
are computed the usual way, no matter that the number of observations for X and Y do not match. As forE[XY]
, in the case of separately sampled X and Y, you can take it as meaning "the mean of all possible combinations ofx_i * y_j
", in other words: