变异性分析算法
我使用很多直方图。特别是,这些直方图是沿着人类基因组片段的碱基检出。
x 轴上的每个点都是组成 DNA 的四个含氮碱基(A、C、T、G)之一,y 轴代表碱基能够被“调用”(或被测序仪识别)的次数机器,以便对基因组进行测序,这只是确定基因组上每个碱基的身份)。
许多直方图显示大致线性的下降(当机器无法获得足够的读取深度时),从类似平台的区域下降到 0 或(几乎为 0)。当分数降至零时,意味着测序仪无法确定碱基的身份。如果您以前见过双螺旋,这意味着测序仪无法确定螺旋的一半梯级的标识。基因组的某些区域比其他区域更难表征。能够明确地识别具有≥100数量级的大量碱基识别的碱基(或x个数据点)。例如,如果一个碱基总共有 250 个碱基调用,我们调用了 248 个 T,调用了 1 个 G,调用了 1 个 A,我们将其称为 T。碱基调用为 0 的区域值得关注,因为这样我们就必须从邻近区域推断出低读数区域的身份。是否有一个简单的算法来为这些图分配反映这种趋势的分数?有关示例历史记录,请参阅 box.net/shared/nbygq2x03u。
I work with a lot of histograms. In particular, these histograms are of basecalls along segments on the human genome.
Each point along the x-axis is one of the four nitrogenous bases(A,C,T,G) that compose DNA and the y-axis represents how many times a base was able to be "called" (or recognized by a sequencer machine, so as to sequence the genome, which is simply determining the identity of each base along the genome).
Many of these histograms display roughly linear dropoffs (when the machines aren't able to get sufficient read depth) that fall to 0 or (almost-0) from plateau-like regions. When the score drops to zero, it means the sequencer isn't able to determine the identity of the base. If you've seen the double helix before, it means the sequencer can't figure out the identify of one half of a rung of the helix. Certain regions of the genome are more difficult to characterize than others. Bases (or x data points) with high numbers of basecalls, on the order of >=100, are able to be definitively identified. For example, if there were a total of 250 calls for one base, and we had 248 T's called, 1 G called, and 1 A called, we would call that a T. Regions with 0 basecalls are of concern because then we've got to infer from neighboring regions what the identity of the low-read region could be. Is there a straightforward algorithm for assigning these plots a score that reflects this tendency? See box.net/shared/nbygq2x03u for an example histo.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以只使用读取深度为 0 的基数计数...该线的斜率也可能是一个有用的指标(陡峭的负斜率 = 从平台下降)。
You could just use the count of base numbers where read depth was 0... The slope of that line could also be a useful indicator (steep negative slope = drop from plateau).