选择唯一映射到 SNP 的基因

发布于 2025-01-10 20:48:29 字数 985 浏览 4 评论 0原文

我有一个 vcf,想要选择 100 个基因,并为每个基因选择一个 SNP? 从技术上讲,如果我们考虑一个基因,它就有许多 rsid 映射到它。对于我的分析,我需要选择 100 个基因,并且对于每个基因仅选择一个 SNP,并忽略其他基因,并得到一个最终的 vcf 文件,其中仅唯一映射到 1 个 rsid 的 100 个基因。 有没有什么包可以做到这一点?

22 16050075 rs587697622 AG 100.0 通过
AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_A F=0.001;AA=.|||;VT=SNP;ANN=G|基因间区域|修饰符|CHR_START-LA16c-4G1.3|CHR_START-ENSG00000233866|基因间区域|CHR_START-ENSG00000233866|||n.16050 075A>G| ||||| GT

这就是我正在尝试做的事情: 库(vcfR)

# Import VCF
my.vcf <- read.vcfR('my.vcf.gz')
# Combine CHROM thru FILTER cols + INFO cols
my.vcf.df <- cbind(as.data.frame(getFIX(my.vcf)), INFO2df(my.vcf))
data_new <- my.vcf.df[!duplicated(my.vcf.df[ ,"ANN"]),]

这是我的数据集的样子:

ANN 列有几个 ENSG00000233866 值。我想根据 ANN 值(即基因 ID)过滤 df。因此,对于 ENSG00000233866,我只想选择一个 rsid 或 rs 值并忽略其他值。我想对其他基因进行类似的操作。

I have a vcf and want to select 100 genes, and for each gene, to select one SNP?
Technically if we consider one gene it has many rsid mapped to it. For my analysis I need to select 100 genes and for each gene select only one SNP for it and ignore others and have a final vcf file with only 100 genes uniquely mapped to 1 rsid only.
Is there any package which can do this?

22 16050075 rs587697622 A G 100.0 PASS
AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_A
F=0.001;AA=.|||;VT=SNP;ANN=G|intergenic_region|MODIFIER|CHR_START-LA16c-4G1.3|CHR_START-ENSG00000233866|intergenic_region|CHR_START-ENSG00000233866|||n.16050 075A>G|||||| GT

This is what I am trying to do:
library(vcfR)

# Import VCF
my.vcf <- read.vcfR('my.vcf.gz')
# Combine CHROM thru FILTER cols + INFO cols
my.vcf.df <- cbind(as.data.frame(getFIX(my.vcf)), INFO2df(my.vcf))
data_new <- my.vcf.df[!duplicated(my.vcf.df[ ,"ANN"]),]

This is how my dataset looks like:
enter image description here

As you can see the ANN column has several ENSG00000233866 values. I want to filter the df based on ANN values that is gene ID. So for ENSG00000233866 i just want to select only one rsid or rs values and ignore others. I want to do this similarly for other genes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

林空鹿饮溪 2025-01-17 20:48:29

考虑 geneANN 中管道分隔列的 str.split,它大约是第四个管道。然后通过使用 ave 进行基因分组来运行顺序计数。然后 subset 组编号为 1。下面请不要在 R 4.1.0+ 中使用新管道 |>

data_new <- within(
  data_new, {
    GENE <- gsub(
       "CHR_START-", "", 
       sapply(strsplit(ANN, "|"), "[", 4)
    )
    GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
  }
) |> subset(
  GENE %in% unique(GENE)[1:100] &  # SUBSET FOR FIRST 100 GENES
  GENE_GROUP_SEQ == 1              # SUBSET FOR FIRST OF EACH GENE GROUP
) |> `row.names<-`(NULL)           # RESET ROW NAMES

对于低于 4.1.0 的 R 版本,请中断将管道输出到单独的调用:

data_new <- within(
  data_new, {
    GENE <- gsub(
       "CHR_START-", "", 
       sapply(strsplit(ANN, "|"), "[", 4)
    )
    GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
  }
)

data_sub <- subset(
  data_new,
  GENE %in% unique(GENE)[1:100] &           # SUBSET FOR FIRST 100 GENES
  GENE_GROUP_SEQ == 1                       # SUBSET FOR FIRST OF EACH GENE GROUP
)

dat_sub <- `row.names<-`(data_sub, NULL)    # RESET ROW NAMES

Consider str.split of the pipe-delimited column in ANN for the gene which is about the fourth pipe. Then run a sequential count by the gene grouping with ave. Then subset group number for the 1. Please not below uses the new pipe, |> in R 4.1.0+:

data_new <- within(
  data_new, {
    GENE <- gsub(
       "CHR_START-", "", 
       sapply(strsplit(ANN, "|"), "[", 4)
    )
    GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
  }
) |> subset(
  GENE %in% unique(GENE)[1:100] &  # SUBSET FOR FIRST 100 GENES
  GENE_GROUP_SEQ == 1              # SUBSET FOR FIRST OF EACH GENE GROUP
) |> `row.names<-`(NULL)           # RESET ROW NAMES

For R versions less that 4.1.0, break out the pipe to separate calls:

data_new <- within(
  data_new, {
    GENE <- gsub(
       "CHR_START-", "", 
       sapply(strsplit(ANN, "|"), "[", 4)
    )
    GENE_GROUP_SEQ <- ave(1:nrow(data_new), GENE, FUN=seq_along)
  }
)

data_sub <- subset(
  data_new,
  GENE %in% unique(GENE)[1:100] &           # SUBSET FOR FIRST 100 GENES
  GENE_GROUP_SEQ == 1                       # SUBSET FOR FIRST OF EACH GENE GROUP
)

dat_sub <- `row.names<-`(data_sub, NULL)    # RESET ROW NAMES
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文