无放回抽样算法?
我正在尝试测试特定数据聚类偶然发生的可能性。 实现此目的的一种可靠方法是蒙特卡洛模拟,其中数据和组之间的关联被随机重新分配大量次(例如 10,000 次),并且使用聚类度量将实际数据与模拟进行比较,以确定 a价值。
我已经完成了大部分工作,使用将分组映射到数据元素的指针,因此我计划随机重新分配指向数据的指针。 问题:什么是无需放回采样的快速方法,以便每个指针在复制数据集中随机重新分配?
例如(这些数据只是一个简化的例子):
数据(n=12 个值)- A 组:0.1、0.2、0.4 / B 组:0.5、0.6、0.8 / C 组:0.4、0.5 / D 组:0.2、0.2、0.3、0.5
对于每个重复数据集,I将具有相同的簇大小(A=3、B=3、C=2、D=4)和数据值,但会将值重新分配给簇。
为此,我可以生成 1-12 范围内的随机数,分配 A 组中的第一个元素,然后生成 1-11 范围内的随机数并分配 A 组中的第二个元素,依此类推。 指针重新分配很快,而且我已经预先分配了所有数据结构,但是不进行替换的采样似乎是一个以前可能已经解决过很多次的问题。
逻辑或伪代码优先。
I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.
I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?
For example (these data are just a simplified example):
Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5
For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.
To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.
Logic or pseudocode preferred.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
下面是一些基于 Knuth 的《半数值算法》一书的算法 3.4.2S 的无替换采样代码。
Jeffrey Scott Vitter 在“An Efficient Algorithm for Sequential Random Sampling”中提出了一种更高效但更复杂的方法,ACM Transactions on Mathematical Software,13(1),1987 年 3 月,58-67。
Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.
There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.
基于John D. Cook 的回答的 C++ 工作代码。
A C++ working code based on the answer by John D. Cook.
受到 @John D. Cook 的回答的启发,我在 Nim 中编写了一个实现。 起初我很难理解它是如何工作的,所以我进行了广泛的评论,还包括一个例子。 也许这有助于理解这个想法。 另外,我还稍微更改了变量名称。
Inspired by @John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.
当总体规模远大于样本规模时,上述算法变得低效,因为它们的复杂度为 O(n),n 为人口规模。
当我还是一名学生时,我编写了一些无需放回的均匀采样算法,其平均复杂度 O(s log s),其中 O(s log s) >s 是样本大小。 下面是二叉树算法的代码,在 R 中平均复杂度为 O(s log s):
该算法的复杂度为讨论于:
鲁赞金,PS; Voytishek,AV 关于随机选择算法的成本。 蒙特卡罗方法应用。 5(1999),没有。 1、39-54。
http://dx.doi.org/10.1515/mcma.1999.5.1.39
如果您觉得该算法有用,请参考。
也可以看看:
P. 古普塔,GP 巴塔查吉。 (1984) 一种无需放回的随机抽样的有效算法。 国际计算机数学杂志 16:4,第 201-209 页。
DOI:10.1080/00207168408803438
Teuhola, J. 和 Nevalainen, O. 1982。两种无需放回的随机采样的有效算法。 /IJCM/,11(2):127-140。
DOI:10.1080/00207168208803304
在上一篇论文中,作者使用哈希表并声称他们的算法具有 O(s) 复杂度。 还有一种快速哈希表算法,很快就会在 pqR(相当快的 R)中实现:
https://stat.ethz.ch/pipermail/r- devel/2017-10月/075012.html
When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.
When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:
The complexity of this algorithm is discussed in:
Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54.
http://dx.doi.org/10.1515/mcma.1999.5.1.39
If you find the algorithm useful, please make a reference.
See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209.
DOI: 10.1080/00207168408803438
Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140.
DOI: 10.1080/00207168208803304
In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R):
https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html
我写了一篇无替换采样算法调查。 我可能有偏见,但我推荐我自己的算法,在下面用 C++ 实现,因为它为许多 k、n 值提供了最佳性能,并为其他值提供了可接受的性能。 假设
randbelow(i)
返回一个公平选择的小于i
的随机非负整数。I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others.
randbelow(i)
is assumed to return a fairly chosen random non-negative integer less thani
.此处介绍了另一种无需放回采样的算法。
它与 John D. Cook 在他的回答中以及 Knuth 所描述的相似,但它有不同的假设:总体规模未知,但样本可以容纳在内存中。 这被称为“Knuth 算法 S”。
引用rosettacode文章:
Another algorithm for sampling without replacement is described here.
It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".
Quoting the rosettacode article: