在 R 中使用 GEOquery 和 SAM/Siggenes

发布于 2024-12-04 09:02:24 字数 1165 浏览 2 评论 0原文

由于需要,我最近开始学习 R,到目前为止,我认为一切都很好。但我仍处于早期阶段。然而,我在 R 中面临着这一重大而紧迫的挑战,我将非常感谢您的帮助。我的编程技能显然是业余的,并且肯定会接受我能得到的任何帮助。如下:

  1. 创建要从 GEO 检索的数据集列表 (gdslist) 数据库使用 GEOquery 包
  2. 将 gdslist 项 (gdsid) 转换为表达式数据。那是, 可以配合我的分析的数据。为此,GDS2eSet 函数 工作做得很好。
  3. 要以这样的方式读入转换后的表达式数据 可以创建类/级别文件 (.cls)。 GDS3715 数据集 例如,有 3 个级别——胰岛素抵抗、胰岛素敏感和 糖尿病患者。有时,数据集就是这么简单。但 其他时候,例如在本例中,分析级别将为 6 目的,因为虽然表型上有 3 个水平,但它们 被分为治疗组和未治疗组。经常有一个 在这种情况下添加了“代理”栏。每个班级/级别都需要 分配一个数字(0,1,2...)。一般情况就这样了 .cls 文件的格式。
  4. 要运行 Siggenes/SAM 分析(也是 R 中的包),需要两个文件 每个数据集需要:一个表达式文件(从 2 个数据集转换而来的文件) 上面)和随附的集群文件(来自 3)。
  5. 为了能够以某种方式对 gdslist 项目运行此过程 循环并将我的数据存储在指定的目录中。

我目前只能进行第 2 步。我认为第 3 步是挑战的关键...

非常感谢您的期待。

到目前为止的脚本:

> gdslist = c('GDS3715','GDS3716','GDS3717'...)#up to perhaps 100 datasets
> analysisfunc = function(gdsid) {
    gdsdat = getGEO(gdsid,destdir=".")
    gdseset = GDS2eSet(gdsdat)
    pData(gdseset)$disease.state #Needed assignment, etc...Step 3 stuff ;Siggenes/SAM can perhaps be done here
    return(sprintf("Results from %s should be here",gdsid))
  }
> resultlist = sapply(gdslist,analysisfunc) #loop function 

I very recently started learning R as a result of a need and so far, so good, I think. But I'm still in the very early stages. I am however faced with this major urgent challenge in R that I will greatly appreciate some help with. My programming skills are quite obviously amateur and will most definitely accept any help I can get. Here goes:

  1. To create a list of datasets (gdslist) to be retrieved from the GEO
    database using the GEOquery package
  2. To convert the gdslist items (gdsid) to expression data. That is,
    data that can work with my analysis. For this, a GDS2eSet function
    does the work fine.
  3. To read in this converted expression data in a way that a
    class/levels file (.cls) can be created. The GDS3715 dataset for
    example, has 3 levels – insulin resistant, insulin sensitive and
    diabetic. Sometimes, the datasets are as straightforward as that. But
    other times, like in this case, the levels will be 6 for analysis
    purposes, because although there are phenotypically 3 levels, they
    have been divided into treated and untreated groups. There is often an
    added “agent” column in such cases. Each class/level needs to be
    assigned a numerical number (0,1,2...). That’s pretty much the general
    format of a .cls file.
  4. To run a Siggenes/SAM analysis (also a package in R), two files are
    needed for each dataset: an expression file (the converted file from 2
    above) and the accompanying cluster file (from 3).
  5. To be able to run this process for the gdslist items in a kind of
    loop and have my data stored in a specified directory.

I can currently only get to step 2. I think step 3 is the crux of the challenge...

Many thanks in anticipation.

Script so far:

> gdslist = c('GDS3715','GDS3716','GDS3717'...)#up to perhaps 100 datasets
> analysisfunc = function(gdsid) {
    gdsdat = getGEO(gdsid,destdir=".")
    gdseset = GDS2eSet(gdsdat)
    pData(gdseset)$disease.state #Needed assignment, etc...Step 3 stuff ;Siggenes/SAM can perhaps be done here
    return(sprintf("Results from %s should be here",gdsid))
  }
> resultlist = sapply(gdslist,analysisfunc) #loop function 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

茶色山野 2024-12-11 09:02:24

这应该适用于所有 gds 数据集。

 GEOSAM.analysis <- function( gdsid, destdir = getwd() ) {
       require( 'GEOquery' )
       require( 'siggenes' )
       ## test if gdsid is gdsid
       if( length(grep('GDS', gdsid)) == 0 ){
        stop()
       }
       gdsdat = getGEO( gdsid, destdir = destdir )
       gdseset = GDS2eSet( gdsdat )
       gdseset.pData <- pData( gdseset )
       gds.factors <- names( gdseset.pData )
       gds.factors[gds.factors == 'sample'] <- NA
       gds.factors[gds.factors == 'description'] <- NA
       gds.factors <- gds.factors[!is.na( gds.factors )]
       cl.list <- sapply( gdseset.pData[gds.factors], as.character)
       cl.list <- factor( apply( cl.list, 1, function(x){ paste( x , collapse = '-' )} ) )
       if( length( levels ( cl.list ) ) == 2 ){
        levels( cl.list ) <- 0:length( levels( cl.list ) )
       } else {
        levels( cl.list ) <- 1:length( levels( cl.list ) )
       }
       sam.gds <- sam( gdseset, cl.list )
       results.file <- file.path( destdir, paste( gdsid, '.sam.gds.rdata', sep =''  ) )
       save( sam.gds, file = results.file )
       return( sprintf( "Results from %s are saved in '%s'. These can be loaded by 'load('%s')'.",gdsid, results.file, results.file ) )
  }

  gdslist = c('GDS3715', 'GDS3716', 'GDS3717')
  resultlist = sapply(gdslist, GEOSAM.analysis)  
  print(resultlist)

This should work for all the gds datasets.

 GEOSAM.analysis <- function( gdsid, destdir = getwd() ) {
       require( 'GEOquery' )
       require( 'siggenes' )
       ## test if gdsid is gdsid
       if( length(grep('GDS', gdsid)) == 0 ){
        stop()
       }
       gdsdat = getGEO( gdsid, destdir = destdir )
       gdseset = GDS2eSet( gdsdat )
       gdseset.pData <- pData( gdseset )
       gds.factors <- names( gdseset.pData )
       gds.factors[gds.factors == 'sample'] <- NA
       gds.factors[gds.factors == 'description'] <- NA
       gds.factors <- gds.factors[!is.na( gds.factors )]
       cl.list <- sapply( gdseset.pData[gds.factors], as.character)
       cl.list <- factor( apply( cl.list, 1, function(x){ paste( x , collapse = '-' )} ) )
       if( length( levels ( cl.list ) ) == 2 ){
        levels( cl.list ) <- 0:length( levels( cl.list ) )
       } else {
        levels( cl.list ) <- 1:length( levels( cl.list ) )
       }
       sam.gds <- sam( gdseset, cl.list )
       results.file <- file.path( destdir, paste( gdsid, '.sam.gds.rdata', sep =''  ) )
       save( sam.gds, file = results.file )
       return( sprintf( "Results from %s are saved in '%s'. These can be loaded by 'load('%s')'.",gdsid, results.file, results.file ) )
  }

  gdslist = c('GDS3715', 'GDS3716', 'GDS3717')
  resultlist = sapply(gdslist, GEOSAM.analysis)  
  print(resultlist)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文