是否可以在保留行频率和列频率的同时对 2D 矩阵进行打乱?

发布于 2024-10-18 11:16:27 字数 1891 浏览 8 评论 0 原文

假设我有一个如下所示的 2D 数组:

GACTG
AGATA
TCCGA

每个数组元素都取自一个小的有限集合(在我的例子中,是 DNA 核苷酸 - {A, C, G, T})。我想以某种方式随机洗牌这个数组,同时保留行列核苷酸频率。这可能吗?能高效完成吗?

[编辑]:我的意思是我想生成一个新的矩阵,其中每行都有相同数量的 ACGT作为原始矩阵的对应行,并且每列具有相同数量的A C、G、T作为原始矩阵的对应列。 置换原始矩阵的行或列通常无法实现此目的。(例如,对于上面的示例,顶行有 2 个 G,每个 G 各有 1 个 < code>ACT;如果该行与第 2 行交换,则结果矩阵中的顶行将有 3 个 A< /code>s、1 G 和 1 T。)

通过一次打乱一列来保留列频率非常简单,对于行也是如此。但这样做通常会改变另一种的频率。

到目前为止我的想法:是否可以选择 2 行和 2 列,以便该矩形角上的 4 个元素具有

XY
YX

某对不同元素的模式 XY,然后将这 4 个元素替换为

YX
XY

将保持行频率和列频率。在顶部的示例中,可以对(至少)第 1 行和第 2 行以及第 2 行和第 5 列(其角给出 2x2 矩阵 AG;GA )以及第 1 行和第 3 行执行此操作以及第 1 列和第 4 列(其角点为 GT;TG)。显然,这可以重复多次以产生一定程度的随机化。

概括一点,由行子集和列子集引起的任何“子矩形”,其中所有行的频率都相同并且所有列的频率都相同,可以将其行和列排列以产生一个有效的完整矩形。 (其中,只有那些至少改变了 1 个元素的子矩形才是真正有趣的。)大问题:

  1. 通过一系列此类“子矩形重新排列”是否可以到达所有有效的完整矩阵?我怀疑答案是的。
  2. 所有有效的子矩形重新排列都可以分解为一系列 2x2 交换吗? [编辑]mhum 的反例表明答案是 。不幸的是,因为这似乎会让想出有效的算法变得更加困难,但了解这一点很重要。
  3. 可以有效地计算部分或全部有效的重新排列吗?

此问题解决了一种特殊情况,其中可能的元素集为 {0, 1}。人们提出的解决方案与我自己提出的解决方案类似,并且可能可用,但并不理想,因为它们需要任意数量的回溯才能正常工作。我还担心只考虑 2x2 交换。

最后,我理想地希望有一个解决方案,可以证明可以从与原始行频率和列频率相同的所有矩阵集中均匀随机选择一个矩阵。我知道,我的要求很多:)

Suppose I have a 2D array like the following:

GACTG
AGATA
TCCGA

Each array element is taken from a small finite set (in my case, DNA nucleotides -- {A, C, G, T}). I would like to randomly shuffle this array somehow while preserving both row and column nucleotide frequencies. Is this possible? Can it be done efficiently?

[EDIT]: By this I mean I want to produce a new matrix where each row has the same number of As, Cs, Gs and Ts as the corresponding row of the original matrix, and where each column has the same number of As, Cs, Gs and Ts as the corresponding column of the original matrix. Permuting the rows or columns of the original matrix will not achieve this in general. (E.g. for the example above, the top row has 2 Gs, and 1 each of A, C and T; if this row was swapped with row 2, the top row in the resulting matrix would have 3 As, 1 G and 1 T.)

It's simple enough to preserve just column frequencies by shuffling a column at a time, and likewise for rows. But doing this will in general alter the frequencies of the other kind.

My thoughts so far: If it's possible to pick 2 rows and 2 columns so that the 4 elements at the corners of this rectangle have the pattern

XY
YX

for some pair of distinct elements X and Y, then replacing these 4 elements with

YX
XY

will maintain both row and column frequencies. In the example at the top, this can be done for (at least) rows 1 and 2 and columns 2 and 5 (whose corners give the 2x2 matrix AG;GA), and for rows 1 and 3 and columns 1 and 4 (whose corners give GT;TG). Clearly this could be repeated a number of times to produce some level of randomisation.

Generalising a bit, any "subrectangle" induced by a subset of rows and a subset of columns, in which the frequencies of all rows are the same and the frequencies of all columns are the same, can have both its rows and columns permuted to produce a valid complete rectangle. (Of these, only those subrectangles in which at least 1 element is changed are actually interesting.) Big questions:

  1. Are all valid complete matrices reachable by a series of such "subrectangle rearrangements"? I suspect the answer is yes.
  2. Are all valid subrectangle rearrangements decomposable into a series of 2x2 swaps? [EDIT]: mhum's counterexample shows that the answer is no. Unfortunate, because this would seem to make it harder to come up with an efficient algorithm, but important to know.
  3. Can some or all of the valid rearrangements be computed efficiently?

This question addresses a special case in which the set of possible elements is {0, 1}. The solutions people have come up with there are similar to what I have come up with myself, and are probably usable, but not ideal as they require an arbitrary amount of backtracking to work correctly. Also I'm concerned that only 2x2 swaps are considered.

Finally, I would ideally like a solution that can be proven to select a matrix uniformly at random from the set of all matrices having identical row frequencies and column frequencies to the original. I know, I'm asking for a lot :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

静谧 2024-10-25 11:16:28

问题2的答案是否定的。考虑以下两个矩阵:

A B C   C A B
C A B   B C A
B C A   A B C

它们显然具有相同的行频率和列频率。然而,不存在具有公共角的 2x2 子矩阵。

The answer to question 2 is no. Consider the following 2 matrices:

A B C   C A B
C A B   B C A
B C A   A B C

They clearly have the same row and column frequencies. Yet, there is no 2x2 submatrix with common corners.

葬﹪忆之殇 2024-10-25 11:16:28

没有线索,但你所说的基本上是一个广义的数独求解器。尝试 http://scholar.google.com/scholar?q=sudoku

No clue, but what you are talking about is basically a generalized sudoku solver. Try http://scholar.google.com/scholar?q=sudoku

渔村楼浪 2024-10-25 11:16:28

事实证明,对于 0-1 矩阵,2x2 交换足以从一个矩阵转换为任何其他矩阵。 HJ Ryser 在一篇名为“零和一矩阵的组合性质”的论文中证明了这一点,即定理 3.1:http://cms.math.ca/cjm/v9/cjm1957v09.0371-0377.pdf 。一段时间以来,人们一直在试图证明基于 2x2 交换的马尔可夫链可以快速混合;这篇论文 http://arxiv.org/pdf/1004.2612v3 似乎是最接近的。

如果可以证明 Ryser 定理对您的情况的推广(也许最多 4x4“交换”),那么由于交换的对称性,获得稳态分布均匀的链不会太难在兴趣矩阵上。我认为目前没有任何希望证明它对于所有可能的行/列分布都能快速混合,但也许你知道一些我们不知道的分布......

It turns out that for 0-1 matrices, 2x2 swaps are sufficient to get from one matrix to any other. This was proved by H J Ryser as Theorem 3.1 in a paper called "Combinatorial Properties of Matrices of Zeros and Ones": http://cms.math.ca/cjm/v9/cjm1957v09.0371-0377.pdf . People have been trying to prove for a while that the Markov chain based on 2x2 swaps mixes rapidly; this paper http://arxiv.org/pdf/1004.2612v3 seems to come the closest.

If one could prove the generalization of Ryser's theorem to your case (maybe with up to 4x4 "swaps"), then on account of the symmetry of the swaps, it wouldn't be too hard to get a chain whose steady state distribution is uniform on the matrices of interest. I don't think there's any hope at the moment of proving that it mixes rapidly for all possible row/column distributions, but perhaps you know something about the distributions that we don't...

橘香 2024-10-25 11:16:27

编辑:哎呀错过了OP问题的最后一段,让我重新表述一下。

简单地说,您链接的问题对所选解决方案的随机性“水平”进行了非常搞笑的讨论,请允许我解释一下:

“...我确实需要尽可能随机的矩阵...”

“...代码中实现的算法是相当随机的...”

“...如果选择此方法,提高随机性的另一种方法是重复随机化过程多次(随机次数)...”

这些评论都没有任何意义,没有“更多”随机之类的东西,这完全就像这个可爱的每日 WTF 条目。也就是说,最后一句话几乎说到点子上了。众所周知,如果您模拟马尔可夫链(如随机交换算法)足够长的时间,您最终将开始从 稳态分布。这个分布究竟是什么样子,谁知道呢……

无论如何,根据您的目标,您可能并不真正关心这个分布是什么样子,只要它包含足够的元素即可。因此,某种交换算法可能有用,但我真的不认为这很容易,因为问题是 NP 完全问题(比数独更通用)。

考虑到这一点,您可以考虑使用任何适用于解决数独的方法来解决您的问题,如果如果您在 Acadamia,我建议您获取一份免费供学术使用的 IBM CPLEX 12 副本。您可以使用 CP 语言 (OPL) 编写类似数独的求解器,并将其作为整数线性程序求解器来为您生成解决方案。我认为他们甚至有解决数独问题的示例代码,您可以借鉴。

这是我能想到的从此类矩阵中进行采样的唯一真正随机且无偏见的方法:首先让 CPLEX 找到给定数独问题的所有 N 个解决方案。获得这组 N 个解后,在 1 到 N 之间抽取一个随机数并使用该解,如果需要另一个解,则抽取另一个数字。由于生成所有解决方案可能有点慢,因此您可以通过告诉求解器在一定数量的解决方案或经过一定时间后停止并仅从该组中进行采样来近似类似的结果。

Edit: oops missed the last paragraph of OP's question, let me rephrase.

To digress briefly, the question you linked to had quite a hilarious discussion about the "level" of randomness for the selected solution, allow me to paraphrase:

"...I really require matrices that are as random as possible..."

"...The algorithm, as implemented in the code, is quite random..."

"...if you choose this method, a different way to improve the randomness is to repeat the randomization process several times (a random number of times)..."

None of these comments make any sort of sense, there is no such thing as "more" random, this is all exactly like this lovely Daily WTF entry. That said, the last quote is almost onto something. It's well known that if you simulate a Markov chain, like that random swapping algorithm, for long enough you will eventually start generating samples from the steady state distribution. Just exactly what that distribution looks like, who knows...

Anyway, depending on your objectives you may not really care what this distribution looks like as long as it contains enough elements. So some sort of swapping algorithm might be useful, but I really would not expect this to be easy since the problem is NP-Complete (more general than Sudoku).

With that in mind, you could consider solving your problem any approach that works for solving Sudoku, if you are in Acadamia I would suggest getting a copy of IBM CPLEX 12 which is free for academic use. You can code up a Sudoku-like solver in their CP language (OPL) and as the integer linear program solver to generate solutions for you. I think they even have example code for solving Sudoku you can borrow from.

Here's the only truly random and unbiased way I can think of to sample from such matrices: First get CPLEX to find all N solutions to the given Sudoku problem. After you have this set of N solutions, draw a random number between 1 and N and use that solution, if you want another one, draw another number. Since generating all solutions might be a bit slow, you could approximate something like this by telling the solver to stop after a certain number of solutions or time elapsed and only sample from that set.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文