DNA序列的混沌游戏

发布于 2024-12-14 02:44:42 字数 867 浏览 4 评论 0原文

我已经尝试过使用mathematica代码来制作这个地址中发布的DNA序列的混沌游戏： http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

就像这样：

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

我拥有的 fasta 序列只是一个字母序列，例如 AACCTTTGATCAAA 要生成的图形如下所示：

在此处输入图像描述

代码对于小序列可以正常工作，但是当我想要时如果要放入一个巨大的序列，例如近40Mb的染色体，程序会花费大量时间，并且只显示一个黑色方块，因此无法分析。是否可以改进上述代码，以便显示它的正方形会更大？顺便说一句，正方形必须只是平方单位。感谢您提前的帮助

原文

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address:
http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html

which is like this:

genome = Import["c:\data\sequence.fasta", "Sequence"];
genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}];
chars = StringCases[genome, "G" | "C" | "T" | "A"];
f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
pts = FoldList[f, {0.5, 0.5}, chars];
Graphics[{PointSize[Tiny], Point[pts]}]

the fasta sequence that I have is just a sequence of letters like AACCTTTGATCAAA
and the graph to be generated comes like this:

enter image description here

the code works fine with small sequences, but when I want to put a huge sequence, for example almost 40Mb of a chromosome, the program takes a lot of time and only displays a black square so that it is impossible to analyze.
Is it possible to improve the aforementioned code, so that the square in which it would be displayed it would be bigger?, by the way the square must be only the square unit.
Thanks for your help in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌辣 2024-12-21 02:44:42

下面增量编辑的摘要：

这将使您使用编译代码计算点坐标时获得相当大的加速（不包括计算 shifts 50 倍）：

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

代码中的瓶颈是在实际渲染图形时，我们不是绘制每个点，而是可视化点的密度：

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

如果一个区域至少有阈值点，则该区域将被涂成黑色。 size 是图像尺寸。通过选择大尺寸或大阈值，您可以避免“黑方块问题”。

我原来的答案有更多细节：

在我相当过时的机器上，代码不是很慢。

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

我得到的时间为 6.8 秒，这是可用的，除非您需要在循环中运行它很多次（如果对于您的用例和机器来说它不够快，请添加评论，我们会尽力加快速度））。

不幸的是，渲染图形所花费的时间比这个长得多（36 秒），我不知道你是否可以对此做些什么。禁用抗锯齿可能有一点帮助，具体取决于您的平台，但效果不大：Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False]（对我来说不是）。对于我们许多人来说，这是一个长期存在的烦恼。

由于整个图形是黑色的，您可以使用鼠标调整其大小并使其变大。下次评估表达式时，输出图形将记住其大小。或者只使用 ImageSize -> 800 作为图形选项。考虑到屏幕的像素密度，我能想到的唯一其他解决方案（不涉及调整图形大小）是使用灰色阴影表示像素密度，并绘制密度。

编辑：

这就是绘制密度的方法（这也比点图计算和渲染快得多！）：

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

调整分辨率以使绘图更漂亮。

对于我的随机序列示例，这仅给出灰色图。对于您的基因组数据，它可能会给出更有趣的模式。

编辑2：

这是使用编译加速函数的简单方法：

首先，用移位向量替换字符（对于数据集只需执行一次，然后就可以保存结果）：

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

然后让我们编译我们的函数：

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

如果您的 Mathematica 版本早于 8 或者您没有安装 C 编译器，请删除 CompilationTarget。

fun[arr]; // Timing

给我 0.6 秒，即瞬间加速 10 倍。

编辑3：

与上面的编译版本相比，通过避免编译函数中的一些内核回调，可以实现约 5 倍的加速（我使用 CompilePrint 检查了编译输出，得出这个版本 --- 否则就不明显为什么它更快）：

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

在我的机器上运行只需 0.11 秒。在更现代的机器上，即使对于 40 MB 的数据集，它也应该在几秒钟内完成。

我将转置分成单独的输入，因为此时 fun1d 的运行时间开始与 Transpose 的运行时间相当。

Summary of the incremental edits below:

This will give you a considerable speedup in computing the point coordinates by using compiled code (50x excluding computing shifts):

shifts = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};
fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], CompilationTarget -> "C"]
pts = Transpose[fun1d /@ Transpose[shifts]];

The bottleneck in your code is actually rendering the graphic, we instead of plotting each point, we'll visualize the density of points:

threshold = 1;
With[{size = 300}, 
 Image[1 - UnitStep[BinCounts[pts, 1/size, 1/size] - threshold]]
]

A region will be coloured black if it has at least threshold points. size is the image-dimension. By either choosing a large size or a large threshold you can avoid the "black square problem".

My original answer with more details:

On my rather dated machine, the code is not very slow.

chars = RandomChoice[{"A", "T", "C", "G"}, 800000];

f[x_, "A"] := x/2;
f[x_, "T"] := x/2 + {1/2, 0};
f[x_, "G"] := x/2 + {1/2, 1/2};
f[x_, "C"] := x/2 + {0, 1/2};
Timing[pts = FoldList[f, {0.5, 0.5}, chars];]
Graphics[{PointSize[Tiny], Point[pts]}]

I get a timing of 6.8 seconds, which is usable unless you need to run it lots of times in a loop (if it's not fast enough for your use case and machine, please add a comment, and we'll try to speed it up).

Rendering the graphic unfortunately takes much longer than this (36 seconds), and I don't know if there's anything you can do about it. Disabling antialiasing may help a little bit, depending on your platform, but not much: Style[Graphics[{PointSize[Tiny], Point[pts]}], Antialiasing -> False] (for me it doesn't). This is a long-standing annoyance for many of us.

Regarding the whole graphic being black, you can resize it using your mouse and make it bigger. The next time you evaluate your expression, the output graphic will remember its size. Or just use ImageSize -> 800 as a Graphics option. Considering the pixel density of screens the only other solution that I can think of (that doesn't involve resizing the graphic) would be to represent pixel density using shades of grey, and plot the density.

EDIT:

This is how you can plot the density (this is also much much faster to compute and render than the point-plot!):

With[{resolution = 0.01}, 
 ArrayPlot@BinCounts[pts, resolution, resolution]
]

Play with the resolution to make the plot nice.

For my random-sequence example, this only gives a grey plot. For your genome data it will probably give a more interesting pattern.

EDIT 2:

Here's a simple way to speed up the function using compilation:

First, replace the characters by the shift vectors (has to be done only once for a dataset, then you can save the result):

arr = chars /. {"A" -> {0., 0.}, "T" -> {.5, 0.}, "G" -> {.5, .5}, "C" -> {0, .5}};

Then let's compile our function:

fun = Compile[{{a, _Real, 2}}, FoldList[#/2 + #2 &, {.5, .5}, a], 
 CompilationTarget -> "C"]

Remove CompilationTarget if your version of Mathematica is earlier than 8 or you don't have a C compiler installed.

fun[arr]; // Timing

gives me 0.6 seconds, which is an instant 10x speedup.

EDIT 3:

Another ~5x speedup is possible compared to the above compiled version by avoiding some kernel callbacks in the compiled function (I checked the compilation output using CompilePrint to come up with this version --- otherwise it's not obvious why it's faster):

fun1d = Compile[{{a, _Real, 1}}, FoldList[#/2 + #2 &, .5, a], 
  CompilationTarget -> "C"]

arrt = Transpose[arr];
Timing[result = fun1d /@ arrt;]
pts = Transpose[result];

This runs in 0.11 seconds on my machine. On a more modern machine it should finish in a few seconds even for a 40 MB dataset.

I split off the transpositions into separate inputs because at this point the running time of fun1d starts to get comparable to the running time of Transpose.

回复收藏 0 原文

~没有更多了~