选择唯一映射到 SNP 的基因
我有一个 vcf,想要选择 100 个基因,并为每个基因选择一个 SNP? 从技术上讲,如果我们考虑一个基因,它就有许多 rsid 映射到它。对于我的分析,我需要选择 100 个基因,并且对于每个基因仅选择一个 SNP,并忽略其他基因,并得到一个最终的 vcf 文件,其中仅唯一映射到 1 个 rsid 的 100 个基因。 有没有什么包可以做到这一点?
22 16050075 rs587697622 AG 100.0 通过
AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_A F=0.001;AA=.|||;VT=SNP;ANN=G|基因间区域|修饰符|CHR_START-LA16c-4G1.3|CHR_START-ENSG00000233866|基因间区域|CHR_START-ENSG00000233866|||n.16050 075A>G| ||||| GT
这就是我正在尝试做的事情: 库(vcfR)
# Import VCF
my.vcf <- read.vcfR('my.vcf.gz')
# Combine CHROM thru FILTER cols + INFO cols
my.vcf.df <- cbind(as.data.frame(getFIX(my.vcf)), INFO2df(my.vcf))
data_new <- my.vcf.df[!duplicated(my.vcf.df[ ,"ANN"]),]
ANN 列有几个 ENSG00000233866 值。我想根据 ANN 值(即基因 ID)过滤 df。因此,对于 ENSG00000233866,我只想选择一个 rsid 或 rs 值并忽略其他值。我想对其他基因进行类似的操作。
I have a vcf and want to select 100 genes, and for each gene, to select one SNP?
Technically if we consider one gene it has many rsid mapped to it. For my analysis I need to select 100 genes and for each gene select only one SNP for it and ignore others and have a final vcf file with only 100 genes uniquely mapped to 1 rsid only.
Is there any package which can do this?
22 16050075 rs587697622 A G 100.0 PASS
AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_A
F=0.001;AA=.|||;VT=SNP;ANN=G|intergenic_region|MODIFIER|CHR_START-LA16c-4G1.3|CHR_START-ENSG00000233866|intergenic_region|CHR_START-ENSG00000233866|||n.16050 075A>G|||||| GT
This is what I am trying to do:
library(vcfR)
# Import VCF
my.vcf <- read.vcfR('my.vcf.gz')
# Combine CHROM thru FILTER cols + INFO cols
my.vcf.df <- cbind(as.data.frame(getFIX(my.vcf)), INFO2df(my.vcf))
data_new <- my.vcf.df[!duplicated(my.vcf.df[ ,"ANN"]),]
This is how my dataset looks like:
As you can see the ANN column has several ENSG00000233866 values. I want to filter the df based on ANN values that is gene ID. So for ENSG00000233866 i just want to select only one rsid or rs values and ignore others. I want to do this similarly for other genes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
考虑 gene 的
ANN
中管道分隔列的str.split
,它大约是第四个管道。然后通过使用ave
进行基因分组来运行顺序计数。然后subset
组编号为 1。下面请不要在 R 4.1.0+ 中使用新管道|>
:对于低于 4.1.0 的 R 版本,请中断将管道输出到单独的调用:
Consider
str.split
of the pipe-delimited column inANN
for the gene which is about the fourth pipe. Then run a sequential count by the gene grouping withave
. Thensubset
group number for the 1. Please not below uses the new pipe,|>
in R 4.1.0+:For R versions less that 4.1.0, break out the pipe to separate calls: