生成随机列联表的有效方法?

发布于 2024-07-22 21:42:54 字数 133 浏览 8 评论 0原文

生成随机列联表的有效方法是什么? 列联表被定义为一个矩形矩阵,每行的总和是固定的,每列的总和是固定的,但各个元素可以是任何元素,只要每行和每列的总和正确即可。

请注意,生成随机列联表非常容易,但我正在寻找比朴素算法更有效的方法。

What is an efficient way to generate a random contingency table? A contingency table is defined as a rectangular matrix such that the sum of each row is fixed, and the sum of each column is fixed, but the individual elements may be anything as long as the sum of each row and column is correct.

Note that it's very easy to generate random contingency tables, but I'm looking for something more efficient than the naive algorithm.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

醉城メ夜风 2024-07-29 21:42:56

此问题的解决方案是算法 AS159,并在 R 中实现。 这是纸

Patefield, WM (1981) 算法 AS159。 一种生成具有给定行和列总计的 rxc 表的有效方法。 应用统计学 30, 91–97。

您可以按照此链接了解该算法的实现。 如果您习惯使用 R,则只需使用其 r2dtable 函数即可。

该算法可用于生成 R 的 chisq.test 函数中生成的卡方检验的蒙特卡罗 p 值。 请注意,chisq.test 不会调用 r2dtable,而是调用 AS159 算法的直接 C 版本。

A solution to this problem, and implemented in R, is Algorithm AS159. Here is the paper

Patefield, W. M. (1981) Algorithm AS159. An efficient method of generating r x c tables with given row and column totals. Applied Statistics 30, 91–97.

You can follow this link for implementations of the algorithm. If you're used to working with R, you can simply use its r2dtable function for this.

This algorithm can be used to generate Monte Carlo p-values for Chi-squared tests as produced in R's chisq.test function. Note that chisq.test does not call r2dtable, but a direct C version of the AS159 algorithm.

梦太阳 2024-07-29 21:42:56

我不确定你的天真的算法是什么。 这是我想到的第一件事:

m\cdot n 变量m+n 线性约束导致解空间具有 (m-1)\cdot (n-1)-1自由度。

假设我们一开始就选择那么多随机数。

  a11     a12   ... a1[n-1] b
  a21     a22   ... a2[n-1] x2-x1+b
  ...     ...   ...   ...   ...
a[m-1]1 a[m-1]2 ...    d    f
   c    y2-y1+c ...    g    e

定义常量 x_i=\sum_{j=1}^{n-1}a_{i,j}y_j=\sum_{i=1}^{m-1} a_{i,j}

这导致对变量 bcdef< /em>、g

x_1+b=\sum_{j=1}^{n-2}a_{m-1,j}+d+f=(m-2)\cdot (c-y_1)+\sum_{\i=1}^{m-2}y_i+g+e
 y_1+c=\sum_{i=1}^{m-2}a_{j,n-1}+d+g=(n-2)\cdot(c-x_1)+\sum_{j=1} ^{n-2}x_j+f+e

这是一个 6 个变量的线性系统。 它可能有一个独特的解决方案......我明天会研究这个。 (目前缺乏 Maple 或其他计算机代数系统。)

I'm not sure what your naïve algorithm is. Here's the first thing I think of:

m\cdot n variables with m+n linear constraints leads to the expectation that the solution space has (m-1)\cdot(n-1)-1 degrees of freedom.

Suppose we just pick that many random numbers to start with.

  a11     a12   ... a1[n-1] b
  a21     a22   ... a2[n-1] x2-x1+b
  ...     ...   ...   ...   ...
a[m-1]1 a[m-1]2 ...    d    f
   c    y2-y1+c ...    g    e

Define the constants x_i=\sum_{j=1}^{n-1}a_{i,j} and y_j=\sum_{i=1}^{m-1}a_{i,j}.

This leads to the following constraints on the variables b, c, d, e, f, g:

x_1+b=\sum_{j=1}^{n-2}a_{m-1,j}+d+f=(m-2)\cdot(c-y_1)+\sum_{\i=1}^{m-2}y_i+g+e
y_1+c=\sum_{i=1}^{m-2}a_{j,n-1}+d+g=(n-2)\cdot(c-x_1)+\sum_{j=1}^{n-2}x_j+f+e

This is a linear system of 6 variables. It probably has a unique solution… I'll work on that tomorrow. (Lack of Maple or other computer algebra systems at the moment.)

旧人 2024-07-29 21:42:56

对我来说,这听起来像是一个约束满足问题 (CSP)。

您基本上会从某个时刻开始,然后从允许值集中随机选择单元格的值。 然后,您更新同一行/列中所有单元格的合格值集,并再次从其合格值集中选择下一个单元格(根据您使用的 CSP 启发式)来(随机)分配值。 同样,您还必须更新同一行/列中所有单元格的合格值集。 如果您遇到具有一组空的合格值的单元格,则必须进行回溯。

但是,“符合条件的值集”的概念可能很难在数据结构中表示,具体取决于您允许的值的范围。

This sounds like a constraint satisfaction problem (CSP) to me.

You would basically start at some point and choose a cell's value randomly from the set of allowed values. Then you update the sets of eligible values for all cells in the same row/column and choose the next cell (according to the CSP heuristic you are using) to (randomly) assign a value to, again from its set of eligible values. Again, you also have to update the sets of eligible values for all cells in the same row/column. In case you encounter a cell that has an empty set of eligible values, you have to do backtracking.

However, the notion of 'set of eligible values' might be hard to represent in a data structure, depending on the range of values you are allowing.

蹲墙角沉默 2024-07-29 21:42:54

查看 R 的 networksis 包的代码可能会有所帮助。 我相信高效的计算需要花哨的马尔可夫链顺序重要性重采样 技术,因此如果可以避免的话,您可能希望避免重新实现它。

编辑:相关论文是 Chen、Diaconis、Holmes 和 Liu (2005) 。 用作者的话来说,“我们的方法与其他现有的蒙特卡罗方法相比具有优势——
基于算法,​​有时效率要高几个数量级。”

Looking at the code of the networksis package for R might be helpful. I believe that efficient computation requires fancy Markov Chain sequential importance resampling techniques, so you might want to avoid reimplementing this if you can avoid it.

Edit: The relevant paper is Chen, Diaconis, Holmes, and Liu (2005). In the words of the authors, "[o]ur method compares favorably with other existing Monte Carlo-
based algorithms, and sometimes is a few orders of magnitude more efficient."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文