DNA序列的混沌游戏
我已经尝试过使用mathematica代码来制作这个地址中发布的DNA序列的混沌游戏: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html
就像这样:
genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]
我拥有的 fasta 序列只是一个字母序列,例如 AACCTTTGATCAAA 要生成的图形如下所示:
代码对于小序列可以正常工作,但是当我想要时如果要放入一个巨大的序列,例如近40Mb的染色体,程序会花费大量时间,并且只显示一个黑色方块,因此无法分析。 是否可以改进上述代码,以便显示它的正方形会更大?顺便说一句,正方形必须只是平方单位。 感谢您提前的帮助
I have tried the mathematica code for making the chaos game for DNA sequences posted in this address:
http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html
which is like this:
genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]
the fasta sequence that I have is just a sequence of letters like AACCTTTGATCAAA
and the graph to be generated comes like this:
the code works fine with small sequences, but when I want to put a huge sequence, for example almost 40Mb of a chromosome, the program takes a lot of time and only displays a black square so that it is impossible to analyze.
Is it possible to improve the aforementioned code, so that the square in which it would be displayed it would be bigger?, by the way the square must be only the square unit.
Thanks for your help in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
下面增量编辑的摘要:
这将使您使用编译代码计算点坐标时获得相当大的加速(不包括计算
shifts
50 倍):代码中的瓶颈是在实际渲染图形时,我们不是绘制每个点,而是可视化点的密度:
如果一个区域至少有
阈值
点,则该区域将被涂成黑色。size
是图像尺寸。通过选择大尺寸或大阈值,您可以避免“黑方块问题”。我原来的答案有更多细节:
在我相当过时的机器上,代码不是很慢。
我得到的时间为 6.8 秒,这是可用的,除非您需要在循环中运行它很多次(如果对于您的用例和机器来说它不够快,请添加评论,我们会尽力加快速度) )。
不幸的是,渲染图形所花费的时间比这个长得多(36 秒),我不知道你是否可以对此做些什么。禁用抗锯齿可能有一点帮助,具体取决于您的平台,但效果不大:
Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False]
(对我来说不是)。对于我们许多人来说,这是一个长期存在的烦恼。由于整个图形是黑色的,您可以使用鼠标调整其大小并使其变大。下次评估表达式时,输出图形将记住其大小。或者只使用
ImageSize -> 800
作为图形
选项。考虑到屏幕的像素密度,我能想到的唯一其他解决方案(不涉及调整图形大小)是使用灰色阴影表示像素密度,并绘制密度。编辑:
这就是绘制密度的方法(这也比点图计算和渲染快得多!):
调整分辨率以使绘图更漂亮。
对于我的随机序列示例,这仅给出灰色图。对于您的基因组数据,它可能会给出更有趣的模式。
编辑2:
这是使用编译加速函数的简单方法:
首先,用移位向量替换字符(对于数据集只需执行一次,然后就可以保存结果):
然后让我们编译我们的函数:
如果您的 Mathematica 版本早于 8 或者您没有安装 C 编译器,请删除
CompilationTarget
。给我 0.6 秒,即瞬间加速 10 倍。
编辑3:
与上面的编译版本相比,通过避免编译函数中的一些内核回调,可以实现约 5 倍的加速(我使用
CompilePrint
检查了编译输出,得出这个版本 --- 否则就不明显为什么它更快):在我的机器上运行只需 0.11 秒。在更现代的机器上,即使对于 40 MB 的数据集,它也应该在几秒钟内完成。
我将转置分成单独的输入,因为此时
fun1d
的运行时间开始与Transpose
的运行时间相当。Summary of the incremental edits below:
This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing
shifts
):The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:
A region will be coloured black if it has at least
threshold
points.size
is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".My original answer with more details:
On my rather dated machine, the code is not very slow.
I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).
Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much:
Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False]
(for me it doesn't). This is a long-standing annoyance for many of us.Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use
ImageSize -> 800
as aGraphics
option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.EDIT:
This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):
Play with the resolution to make the plot nice.
For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.
EDIT 2:
Here's a simple way to speed up the function using compilation:
First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):
Then let's compile our function:
Remove
CompilationTarget
if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.gives me 0.6 seconds, which is an instant 10x speedup.
EDIT 3:
Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using
CompilePrint
to come up with this version --- otherwise it's not obvious why it's faster):This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.
I split off the transpositions into separate inputs because at this point the running time of
fun1d
starts to get comparable to the running time ofTranspose
.