在 R 中使用多核分析 GWAS 数据

发布于 2024-12-20 20:57:55 字数 1419 浏览 4 评论 0原文

我正在使用 R 来分析全基因组关联研究数据。我有大约 500,000 个潜在的预测变量（单核苷酸多态性或 SNP），并且想要测试每个变量与连续结果（在本例中为血液中的低密度脂蛋白浓度）之间的关联。

我已经编写了一个脚本，可以毫无问题地执行此操作。简单地解释一下，我有一个数据对象，称为“数据”。每一行对应于研究中的特定患者。有年龄、性别、体重指数 (BMI) 和血液低密度脂蛋白浓度的列。还有 50 万个其他列包含 SNP 数据。

我目前正在使用 for 循环运行线性模型 50 万次，如图所示：

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

所有这些都运行良好。然而，我真的很想加快我的分析速度。因此，我一直在尝试多核、DoMC 和 foreach 软件包。

我的问题是，有人可以帮助我使用 foreach 方案调整此代码吗？

我正在一台显然有 16 个可用核心的 Linux 服务器上运行该脚本。我尝试过使用 foreach 包进行实验，但使用它的结果相对较差，这意味着使用 foreach 运行分析需要更长。

例如，我尝试保存线性模型对象，如下所示：

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

这花费的时间是仅使用常规 for 循环的两倍多。任何关于如何更好或更快速地做到这一点的建议将不胜感激！我知道使用 lapply 的并行版本可能是一种选择，但也不知道如何做到这一点。

祝一切顺利，

亚历克斯

原文

I am using R to analyze genome-wide association study data. I have about 500,000 potential predictor variables (single-nucleotide polymorphisms, or SNPs) and want to test the association between each of them and a continuous outcome (in this case low-density lipoprotein concentration in the blood).

I have already written a script that does this without problem. To briefly explain, I have a data object, called "Data". Each row corresponds to a particular patient in the study. There are columns for age, gender, body mass index (BMI), and blood LDL concentration. There are also half a million other columns with the SNP data.

I am currently using a for loop to run the linear model half a million times, as shown:

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

All of that works fine. However, I would really like to speed up my analysis. I've therefore been experimenting with the multicore, DoMC, and foreach packages.

My question is, could someone please help me adapt this code using the foreach scheme?

I am running the script on a Linux server that apparently has 16 cores available. I've tried experimenting with the foreach package, and my results using it have been comparatively worse, meaning that it takes longer to run the analysis using foreach.

For example, I've tried saving the linear model objects as shown:

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

This takes more than twice as long as using just a regular for loop. Any advice on how to do this better or more quickly would be appreciated! I understand that using the parallel version of lapply might be an option, but don't know how to do this either.

All the best,

Alex

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花间憩 2024-12-27 20:57:55

给您一个启动机会：如果您使用 Linux，则可以使用 parallel 包中包含的 multicore 方法。虽然在使用 foreach 包时需要设置整个事情，但使用这种方法就不再需要了。您的代码将在 16 个核心上运行，只需执行以下操作：

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

在这里，您创建一个函数，返回具有所需数量的向量，并对其应用索引。虽然我无法检查这一点，因为我无权访问数据，但它应该可以工作。

To give you a startup: If you use Linux, you can do the multicore approach contained within the parallel package. Whereas you needed to set up the whole thing when using eg the foreach package, that's not necessary any more with this approach. Your code would be run on 16 cores by simply doing :

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

Here you make a function that returns a vector with the wanted quantities, and apply the indices over this. I couldn't check this though as I don't have access to the data, but it should work.

回复收藏 0 原文

~没有更多了~