根据常见的子模式对短的同质字符串(DNA)进行聚类并提取类别的共识
任务:
将大量短 DNA 片段聚类到共享共同子序列模式的类别中,并找到每个类别的共有序列。
- 泳池:约。 300 个序列片段
- 每个片段 8 - 20 个字母
- 4 个可能的字母:a、g、t、c
- 每个片段由三个区域构成:
- 5个通用字母
- 8 个或更多 g 和 c 位置
- 5个通用字母
(正则表达式为[gcta]{5}[gc]{8,}[gcta]{5}
)
计划:
执行多重比对(即使用ClustalW2)以查找在区域2中共享公共序列及其共有序列的类。
问题:
- 我的片段是否太短,增加它们的大小是否有帮助?
- 区域 2 是否过于同质(只有两种允许的字母类型)无法显示其序列中的模式?
- 您可以建议哪些替代方法或工具来完成此任务?
最好的问候,
西蒙
Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.
- Pool: ca. 300 sequence fragments
- 8 - 20 letters per fragment
- 4 possible letters: a,g,t,c
- each fragment is structured in three regions:
- 5 generic letters
- 8 or more positions of g's and c's
- 5 generic letters
(As regex that would be[gcta]{5}[gc]{8,}[gcta]{5}
)
Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.
Questions:
- Are my fragments too short, and would it help to increase their size?
- Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
- Which alternative methods or tools can you suggest for this task?
Best regards,
Simon
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,考虑到这是人类基因组,并且您本质上只是在寻找特定的 8 聚体,300 个太少了。基因组中有 65,536 个可能的 8 聚体和 3,000,000,000 个独特碱基(假设您正在查看整个基因组,而不仅仅是基因或编码区域)。你会发现 G/C 包含序列 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 次(而且可能更多,因为与其他东西相比,基因组充满了 CpG 岛)。为什么只选择300?
您不想使用正则表达式来完成此任务。从 1 号染色体开始,寻找第一个 CG 或 GC,然后延伸,直到找到第一个非 G 或 C。然后获取该序列及其上下文并将其保存(在数据库中)。冲洗并重复。
对于这个项目,Clustal 可能有点矫枉过正——但我不知道你的目标,所以我不能确定。如果您只对 GC 区域感兴趣,那么您可以进行一些简单的聚类,如下所示:
现在,对于每个 8 聚体,您都有数千个包含它的序列。我将把数据分析留给您自己的目标。
Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?
You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.
For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:
Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.
您的区域二(包含 2 个字母)可能会有点过于相似,增加长度或可变性(例如更多字母)可能会有所帮助。
Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.