R 中的典型相关分析

发布于 2024-11-04 14:03:59 字数 1024 浏览 6 评论 0原文

我正在使用 R (和包 CCA)并尝试使用两个变量集(分别存储为两个矩阵 Y 和 X 的物种丰度和食物丰度)执行正则化典型相关分析,其中单位数量 (N=15)小于矩阵中变量的数量,大于400(其中大部分是潜在的“解释”变量,只有12-13个“响应”变量)。冈萨雷斯等人。 (2008 年,http://www.jstatsoft.org/v23/i12/paper)请注意,该软件包“包括 CCA 的正则化版本,用于处理变量多于单位的数据集”,这当然是我只有 15 个“单位”的情况。因此,我尝试使用 CCA 包执行正则化规范相关分析,以便查看变量集中的关系。我一直在关注 Gonzalez 等人 (2008) 在他们的论文中经历的过程。但是,我收到一条错误消息 Error in chol.default(Bmat) : theleading secondary of order 12 is not Positive Defined 并且我不知道这意味着什么或该怎么办。这是代码,任何有关该主题的想法或知识将不胜感激。

library(CCA)
correl <- matcor(X, Y)
img.matcor(correl, type = 2)
res.regul <- estim.regul(X, Y, plt = TRUE,
    grid1 = seq(0.0001, 0.2, l=51),
    grid2 = seq(0, 0.2, l=51))

Error in chol.default(Bmat) : the leading minor of order 12 is not positive definite

(注意:当您使用 CCA 的示例数据 nutrimouse 时,estim.regul() 需要很长时间(约 30-40 分钟)才能完成)。

有什么建议吗?有谁知道如何处理此错误?是因为我的某些专栏中有 NA 吗?是否是由于列中有太多 0?预先感谢您对此综合统计数据和信息提供的任何帮助。 R新手。

I'm using R (and package CCA) and trying to perform regularized canonical correlation analysis with two variable sets (species abundances and food abundances stored as the two matrices Y and X, respectively) in which the number of units (N=15) is less than the number of variables in the matrices, which is >400 (most of them being potential "explanatory" variables, with only 12-13 "response" variables). Gonzalez et al. (2008, http://www.jstatsoft.org/v23/i12/paper) note that the package "includes a regularized version of CCA to deal with data sets with more variables than units", which is certainly what I have with only 15 "units." Thus, I'm trying to perform regularized canonical correlation analysis using the CCA package in order to look at the relationships in my variable sets. I have been following the process Gonzalez et al (2008) go through in their paper. However, I get to an error message Error in chol.default(Bmat) : the leading minor of order 12 is not positive definite and I do not know what it means or what to do about it. Here is the code, and any ideas or knowledge on the subject would be appreciated.

library(CCA)
correl <- matcor(X, Y)
img.matcor(correl, type = 2)
res.regul <- estim.regul(X, Y, plt = TRUE,
    grid1 = seq(0.0001, 0.2, l=51),
    grid2 = seq(0, 0.2, l=51))

Error in chol.default(Bmat) : the leading minor of order 12 is not positive definite

(Note: estim.regul() takes a long time (~30-40 min) to complete when you use the sample data nutrimouse from CCA).

Any advice? Does anyone know what to do about this error? Is it because some of my columns have an NA in them? Could it be due to columns with too many 0's? Thanks in advance for any help you can offer to this combined stats & R novice.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东北女汉子 2024-11-11 14:03:59

背景

典型相关分析 (CCA) 是一种探索性数据分析 (EDA) 技术,可估计在同一实验单元上收集的两组变量之间的相关关系。通常,用户将有两个数据矩阵,X 和 Y,其中行代表实验单位,nrow(X) == nrow(Y)。

在 R 中,基础包提供了函数 cancor() 来启用 CCA。这仅限于观察数量大于变量(特征)数量的情况,nrow(X) > 。 ncol(X)。

R 包 CCA 是提供扩展 CCA 功能的几个包之一。 CCA 包提供了一组围绕 cancor() 的包装函数,可以考虑特征计数超过实验单元计数的情况,即 ncol(X) > ncol(X) > 。 n行(X)。 Gonzalez 等人 (2008) CCA:扩展规范相关分析的 R 包,描述了一些细节上的工作。 CCA 包 版本 1.2(2014 年 7 月 2 日发布)在撰写本文时是最新的。

也许还值得一提的是,早期答案中提到的包 kinshipaccuracy 不再托管在 CRAN 上。

诊断

在跳到其他软件包或对您的(大概来之不易!)数据应用未知方法之前,尝试诊断数据问题可能是有益的。

理想情况下,传递到此处提到的任何 CCA 例程的矩阵在数字上应该是完整的(没有缺失值)。理想情况下,传递到此处提到的任何 CCA 例程的矩阵在数字上应该是完整的(没有缺失值)。该过程估计的规范相关数将等于 X 和 Y 的最小列秩,即 <= min(ncol(X), ncol(Y))。理想情况下,每个矩阵的列将是线性独立的(不是其他矩阵的线性组合)。

示例:

library(CCA)
data(nutrimouse)
X <- as.matrix(nutrimouse$gene[,1:10])
Y <- as.matrix(nutrimouse$lipid)

cc(X,Y) ## works

X[,1] <- 2 * X[,9] ## column 9 no longer provides unique information

cc(X,Y)

Error in chol.default(Bmat) :
  the leading minor of order 9 is not positive definite

这是原始帖子中看到的症状。一个简单的测试是尝试在没有该列的情况下进行拟合。

cc(X[,-9],Y) ## works

因此,虽然从分析中删除数据的意义上来说这可能会令人沮丧,但该数据无论如何都不会提供信息。您的分析效果取决于您提供的数据。

此外,有时可以通过对一个(或两个)输入矩阵使用标准化(参见 ?scale)变量来解决数值不稳定问题:

X <- scale(X)

说到这里,也许值得指出的是,规范化 CCA 本质上是一个两步过程。进行交叉验证来估计正则化参数(使用 estim.regul()),然后使用这些参数执行正则化 CCA(使用 rcc()) 。

论文中提供的示例(原始帖子中逐字使用的参数)

res.regul <- estim.regul(X, Y, plt = TRUE,
                               grid1 = seq(0.0001, 0.2, l=51),
                               grid2 = seq(0, 0.2, l=51))

要求在 51*51 = 2601 单元格网格上进行交叉验证。虽然这会为论文生成漂亮的图形,但对于您自己的数据进行初始测试,这些设置并不明智。正如作者所说,“计算要求不是很高。在 51 x 51 网格的“当前使用”计算机上,它持续了不到一个半小时”。自 2008 年以来,情况有所改善,但 生成的默认 5 x 5 网格

estim.regul(X,Y,plt=TRUE) 

是用于探索目的的绝佳选择。如果你要犯错误,你最好尽快犯。

Background

Canonical Correlation Analysis (CCA) is an exploratory data analysis (EDA) technique providing estimates of the correlation relationship between two sets of variables collected on the same experimental units. Typically, users will have two matrices of data, X and Y, where the rows represent the experimental units, nrow(X) == nrow(Y).

In R, the base package provides the function cancor() to enable CCA. This is limited to cases where the number of observations is greater than the number of variables (features), nrow(X) > ncol(X).

The R package CCA is one of several which provide extended CCA functionality. Package CCA offers a set of wrapper functions around cancor() which enable consideration of cases where the feature count exceeds the count of experimental units, ncol(X) > nrow(X). Gonzalez et al (2008) CCA: An R Package to Extend Canonical Correlation Analysis, describes the workings in some detail. Version 1.2 of package CCA (published 2014-07-02) is current at the time of writing.

Probably also worth mentioning that packages kinship and accuracy mentioned in an earlier answer are no longer hosted on CRAN.

Diagnosis

Before jumping to other packages or applying unknown methods to your (presumably hard-won!) data, it's arguably beneficial to try and diagnose what the data problem may be.

The matrices passed to any of the CCA routines mentioned here should ideally be numerically complete (no missing values). The matrices passed to any of the CCA routines mentioned here should ideally be numerically complete (no missing values). The number of canonical correlates estimated by the procedure will be equal to the minimum column rank of X and Y, that is <= min(ncol(X), ncol(Y)). Ideally, the columns of each matrix will be linearly independent (not linear combinations of others).

Example:

library(CCA)
data(nutrimouse)
X <- as.matrix(nutrimouse$gene[,1:10])
Y <- as.matrix(nutrimouse$lipid)

cc(X,Y) ## works

X[,1] <- 2 * X[,9] ## column 9 no longer provides unique information

cc(X,Y)

Error in chol.default(Bmat) :
  the leading minor of order 9 is not positive definite

This is the symptom seen in the original post. One simple test is to try fitting without that column

cc(X[,-9],Y) ## works

So, while this may be frustrating in the sense that you are removing data from the analyis, that data is not providing information anyway. Your analyses can only ever be as good as the data you provide.

Also, sometimes numerical instability can be addressed by using standarized (see ?scale) variables for one (or both) of the input matrices:

X <- scale(X)

While we're here, it's perhaps worth pointing out that regularized CCA is essentially a two-step process. A cross-validation is undertaken to estimate regularization parameters (using estim.regul()), and those parameters are then used to perform the regularized CCA (with rcc()).

The example provided in the paper (arguments used verbatim in the original post)

res.regul <- estim.regul(X, Y, plt = TRUE,
                               grid1 = seq(0.0001, 0.2, l=51),
                               grid2 = seq(0, 0.2, l=51))

calls for cross-validation on a 51*51 = 2601 cell grid. While this produces a nice graphic for the paper, these are not sensible settings for initial tests on your own data. As the authors state, "The computation is not very demanding. It lasted less than one hour and half on a 'current use' computer for the 51 x 51 grid". Things have improved a little since 2008, but the default 5 x 5 grid produced by

estim.regul(X,Y,plt=TRUE) 

is an excellent choice for exploratory purposes. If you're going to make mistakes, you might as well make them as quickly as possible.

死开点丶别碍眼 2024-11-11 14:03:59

您的 X 方差-协方差矩阵不是正定的,因此在内部调用 fda::geigen 时会出现错误。

mixOmics 包中有一个类似的正则化 CCA 函数,但是我猜它会导致相同的错误消息,因为它基本上使用相同的方法(除了他们将 geigen 函数直接插入到 rcc 函数中)。我实际上不记得如何让它处理我的数据,解决相关问题(但一旦我再次找到它,我会查看我的旧代码:-)

一种解决方案是使用广义 Cholesky 分解。 亲属关系 (gchol; 注意,它返回一个下三角矩阵)或准确度 (sechol) 包。当然,这意味着修改函数内部的代码,但这并不是真正的问题,IMO。或者您可以尝试使用 corpcor 包。

作为替代方案,您可以考虑使用 RGCCA 包,该包为 PLS(路径建模)和具有 k 个块的 CCA 方法提供统一的接口。

Your X variance-covariance matrix is not positive-definite, hence the error when internally calling fda::geigen.

There's a similar function for regularized CCA in the mixOmics package, but I guess it will lead to the same error message because it basically uses the same approach (except that they plugged the geigen function directly into the rcc function). I can't actually remember how I get it to work with my data, for a related problem (but I'll look into my old code once I find it again :-)

One solution would be to use a generalized Cholesky decomposition. There is one in the kinship (gchol; be careful, it returns a lower triangular matrix) or accuracy (sechol) package. Of course, this means modifying the code inside the function, but it is not really a problem, IMO. Or you can try to make Var(X) PD with make.positive.definite from the corpcor package.

As an alternative, you might consider using the RGCCA package, which offers an unified interface for PLS (path modeling) and CCA methods with k blocks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文