优化群体中配子频率的计算

发布于 2024-12-01 08:23:47 字数 1290 浏览 3 评论 0原文

我需要优化群体中配子频率的计算。

我有 np 个种群，每个种群中有 Ne 个个体。每个个体由两个配子（雄性和雌性）形成。每个配子包含三个基因。每个 gen 可能是 0 或 1。所以每个个体都是一个2x3的矩阵。矩阵的每一行都是父母之一给出的配子。每个群体中的个体集合可以是任意的（但总是Ne长度）。为简单起见，个体的初始种群可以给出为：

Ne = 300; np = 3^7;
(*This table may be arbitrary with the same shape*)
ind = Table[{{0, 0, 0}, {1, 1, 1}}, {np}, {Ne}]

所有可能的配子的完整集合：

allGam = Tuples[{0, 1}, 3]

每个个体可以通过 8 种可能的方式以相同的概率生成配子。这些配子是：Tuples@Transpose@ind[[iPop, iInd]]（其中 iPop 和 iInd - 群体和个体的索引那个人口）。我需要计算每个群体的个体产生配子的频率。

此时我的解决方案如下。

首先，我将每个个体转换成它可以产生的配子：

gamsInPop = Map[Sequence @@ Tuples@Transpose@# &, ind, {2}]

但更有效的方法是：

gamsInPop = 
 Table[Join @@ Table[Tuples@Transpose@ind[[i, j]], {j, 1, Ne}], {i, 1, np}]

其次，我计算产生的配子的频率，包括可能但在群体中不存在的配子的零频率：

gamFrq = Table[Count[pop, gam]/(8 Ne), {pop, gamInPop}, {gam, allGam}]

此代码的更有效版本：

gamFrq = Total[
   Developer`ToPackedArray[
    gamInPop /. Table[
      allGam[[i]] -> Insert[{0, 0, 0, 0, 0, 0, 0}, 1, i], {i, 1, 
       8}]], {2}]/(8 Ne)

不幸的是，代码仍然太慢。有人可以帮我加快速度吗？

原文

I need to optimize calculation of the frequencies of gametes in populations.

I have np populations and Ne individuals in each population. Each individual is formed by two gametes (male and female). Each gamete contains three genes. Each gen may be 0 or 1. So each individual is a 2x3 matrix. Each row of the matrix is a gamete given by one of the parents. The set of individuals in each population may be arbitrary (but always of Ne length). For simplicity initial populations with individuals may be given as:

Ne = 300; np = 3^7;
(*This table may be arbitrary with the same shape*)
ind = Table[{{0, 0, 0}, {1, 1, 1}}, {np}, {Ne}]

Full set of all possible gametes:

allGam = Tuples[{0, 1}, 3]

Each individual can generate a gamete by 8 possible ways with equal probability. These gametes are: Tuples@Transpose@ind[[iPop, iInd]] (where iPop and iInd - indexes of population and of individual in that population). I need to calculate the frequencies of gametes generated by individuals for each population.

At this moment my solution is as follows.

At first, I convert each individual into gametes it can produce:

gamsInPop = Map[Sequence @@ Tuples@Transpose@# &, ind, {2}]

But more efficient way to do this is:

gamsInPop = 
 Table[Join @@ Table[Tuples@Transpose@ind[[i, j]], {j, 1, Ne}], {i, 1, np}]

Secondly, I calculate the frequencies of gametes produced including zero frequencies for gametes that are possible but absent in population:

gamFrq = Table[Count[pop, gam]/(8 Ne), {pop, gamInPop}, {gam, allGam}]

More efficient version of this code:

gamFrq = Total[
   Developer`ToPackedArray[
    gamInPop /. Table[
      allGam[[i]] -> Insert[{0, 0, 0, 0, 0, 0, 0}, 1, i], {i, 1, 
       8}]], {2}]/(8 Ne)

Unfortunately, the code is still too slow. Can anybody help me to speed-up it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拔了角的鹿 2024-12-08 08:23:47

这段代码：

Clear[getFrequencies];
Module[{t = 
   Developer`ToPackedArray[
     Table[FromDigits[#, 2] & /@ 
         Tuples[Transpose[{
            PadLeft[IntegerDigits[i, 2], 3], 
            PadLeft[IntegerDigits[j, 2], 3]}]], 
       {i, 0, 7}, {j, 0, 7}]
    ]},
   getFrequencies[ind_] :=
    With[{extracted = 
       Partition[
          Flatten@Extract[t, Flatten[ind.(2^Range[0, 2]) + 1, 1]], 
          Ne*8]},
        Map[
         Sort@Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] &@Tally[#] &, 
         extracted
        ][[All, All, 2]]/(Ne*8)
    ]
]

利用十进制数和压缩数组的转换，在我的机器上将代码速度提高了 40 倍。基准：

In[372]:= Ne=300;np=3^7;
(*This table may be arbitrary with the same shape*)
inds=Table[{{0,0,0},{1,1,1}},{np},{Ne}];

In[374]:= 
getFrequencies[inds]//Short//Timing
Out[374]= {0.282,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

In[375]:= 
Timing[
  gamsInPop=Table[Join@@Table[Tuples@Transpose@inds[[i,j]],{j,1,Ne}],{i,1,np}];
  gamFrq=Total[Developer`ToPackedArray[gamsInPop/.Table[allGam[[i]]->
         Insert[{0,0,0,0,0,0,0},1,i],{i,1,8}]],{2}]/(8 Ne)//Short]

Out[375]= {10.563,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
  {1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

请注意，一般来说（对于随机群体），您和我的解决方案中的频率排序由于某种原因不同，

In[393]:= fr[[All,{1,5,3,7,2,6,4,8}]] == gamFrq
Out[393]= True

现在，一些解释：首先，我们创建一个表t，其中构造如下：每个配子被分配一个从0到7的数字，对应于其中被视为二进制数字的零和一。然后，该表具有个体产生的可能配子，存储在位置 {i,j} 中，其中 i 是母亲配子（例如）的十进制，并且 < code>j - 对于父亲，对于该个体（每个个体由一对 {i,j} 唯一标识）。个体产生的配子也转换为小数。它看起来是这样的：

In[396]:= t//Short[#,5]&
Out[396]//Short= {{{0,0,0,0,0,0,0,0},{0,1,0,1,0,1,0,1},{0,0,2,2,0,0,2,2},
{0,1,2,3,0,1,2,3},{0,0,0,0,4,4,4,4},{0,1,0,1,4,5,4,5},{0,0,2,2,4,4,6,6},
{0,1,2,3,4,5,6,7}},<<6>>,{{7,6,5,4,3,2,1,0},{7,7,5,5,3,3,1,1},{7,6,7,6,3,2,3,2},
<<2>>,{7,7,5,5,7,7,5,5},{7,6,7,6,7,6,7,6},{7,7,7,7,7,7,7,7}}}

一个非常重要（关键）的步骤是将该表转换为压缩数组。

Flatten[ind.(2^Range[0, 2]) + 1, 1]] 行将所有群体中所有个体的父母配子从二进制转换为十进制，并加 1因此，这些成为索引，在该索引处，可能产生配子的列表被存储在给定个体的表t中。然后，我们针对所有群体一次性提取所有数据，并使用Flatten和Partition恢复群体结构。然后，我们使用 Tally 计算频率，附加频率为零的缺失配子（通过 Join[#, Thread[{Complement[Range[0, 7], #[[All, 1] 完成） ]]，0}]] 行），并对固定群体的每个频率列表进行排序。最后，我们提取频率并丢弃配子小数索引。

自从在打包数组上执行以来，所有操作都非常快。加速是由于问题的矢量化表述和压缩数组的使用。它的内存效率也更高。

This code:

Clear[getFrequencies];
Module[{t = 
   Developer`ToPackedArray[
     Table[FromDigits[#, 2] & /@ 
         Tuples[Transpose[{
            PadLeft[IntegerDigits[i, 2], 3], 
            PadLeft[IntegerDigits[j, 2], 3]}]], 
       {i, 0, 7}, {j, 0, 7}]
    ]},
   getFrequencies[ind_] :=
    With[{extracted = 
       Partition[
          Flatten@Extract[t, Flatten[ind.(2^Range[0, 2]) + 1, 1]], 
          Ne*8]},
        Map[
         Sort@Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] &@Tally[#] &, 
         extracted
        ][[All, All, 2]]/(Ne*8)
    ]
]

utilizes conversion to decimal numbers and packed arrays, and speeds your code up by a factor of 40 on my machine. The benchmarks:

In[372]:= Ne=300;np=3^7;
(*This table may be arbitrary with the same shape*)
inds=Table[{{0,0,0},{1,1,1}},{np},{Ne}];

In[374]:= 
getFrequencies[inds]//Short//Timing
Out[374]= {0.282,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

In[375]:= 
Timing[
  gamsInPop=Table[Join@@Table[Tuples@Transpose@inds[[i,j]],{j,1,Ne}],{i,1,np}];
  gamFrq=Total[Developer`ToPackedArray[gamsInPop/.Table[allGam[[i]]->
         Insert[{0,0,0,0,0,0,0},1,i],{i,1,8}]],{2}]/(8 Ne)//Short]

Out[375]= {10.563,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
  {1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

Note that in general (for random populations), the ordering of frequencies in your and my solutions are for some reason different, and

In[393]:= fr[[All,{1,5,3,7,2,6,4,8}]] == gamFrq
Out[393]= True

Now, some explanation: first, we create a table t, which is constructed as follows: each gamete is assigned a number from 0 to 7, which corresponds to the zeros and ones in it treated as binary digits. The table then has the possible gametes produced by an individual, stored in a position {i,j}, where i is a decimal for mother's gamete (say), and j - for fathers's, for that individual (each individual is uniquely identified by a pair {i,j}). The gametes produced by individual are also converted to decimals. Here is how it looks:

In[396]:= t//Short[#,5]&
Out[396]//Short= {{{0,0,0,0,0,0,0,0},{0,1,0,1,0,1,0,1},{0,0,2,2,0,0,2,2},
{0,1,2,3,0,1,2,3},{0,0,0,0,4,4,4,4},{0,1,0,1,4,5,4,5},{0,0,2,2,4,4,6,6},
{0,1,2,3,4,5,6,7}},<<6>>,{{7,6,5,4,3,2,1,0},{7,7,5,5,3,3,1,1},{7,6,7,6,3,2,3,2},
<<2>>,{7,7,5,5,7,7,5,5},{7,6,7,6,7,6,7,6},{7,7,7,7,7,7,7,7}}}

A very important (crucial) step is to convert this table to a packed array.

The line Flatten[ind.(2^Range[0, 2]) + 1, 1]] converts parents' gametes from binary to decimal for all individuals at once, in all populations, and adds 1 so that these become indices at which the list of possible to produce gametes is stored in a table t for a given individual. We then Extract all of them at once, for all populations, and use Flatten and Partition to recover back the population structure. Then, we compute frequencies with Tally, append missing gametes with frequencies zero (done by Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] line), and Sort each frequency list for a fixed population. Finally, we extract the frequencies and discard the gamete decimal index.

All operations are pretty fast since performed on packed arrays. The speedup is due to the vectorized formulation of the problem and use of packed arrays. It is also much more memory - efficient.

回复收藏 0 原文

~没有更多了~