无放回抽样算法?

发布于 2024-07-09 10:46:59 字数 593 浏览 10 评论 0原文

我正在尝试测试特定数据聚类偶然发生的可能性。 实现此目的的一种可靠方法是蒙特卡洛模拟,其中数据和组之间的关联被随机重新分配大量次(例如 10,000 次),并且使用聚类度量将实际数据与模拟进行比较,以确定 a价值。

我已经完成了大部分工作,使用将分组映射到数据元素的指针,因此我计划随机重新分配指向数据的指针。 问题:什么是无需放回采样的快速方法,以便每个指针在复制数据集中随机重新分配?

例如(这些数据只是一个简化的例子):

数据(n=12 个值)- A 组:0.1、0.2、0.4 / B 组:0.5、0.6、0.8 / C 组:0.4、0.5 / D 组:0.2、0.2、0.3、0.5

对于每个重复数据集,I将具有相同的簇大小(A=3、B=3、C=2、D=4)和数据值,但会将值重新分配给簇。

为此,我可以生成 1-12 范围内的随机数,分配 A 组中的第一个元素,然后生成 1-11 范围内的随机数并分配 A 组中的第二个元素,依此类推。 指针重新分配很快,而且我已经预先分配了所有数据结构,但是不进行替换的采样似乎是一个以前可能已经解决过很多次的问题。

逻辑或伪代码优先。

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

违心° 2024-07-16 10:46:59

下面是一些基于 Knuth 的《半数值算法》一书的算法 3.4.2S 的无替换采样代码。

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

Jeffrey Scott Vitter 在“An Efficient Algorithm for Sequential Random Sampling”中提出了一种更高效但更复杂的方法,ACM Transactions on Mathematical Software,13(1),1987 年 3 月,58-67。

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

少女情怀诗 2024-07-16 10:46:59

基于John D. Cook 的回答的 C++ 工作代码。

#include <random>
#include <vector>

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far

    std::default_random_engine re;
    std::uniform_real_distribution<double> dist(0,1);

    while (m < n)
    {
        double u = dist(re); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}

A C++ working code based on the answer by John D. Cook.

#include <random>
#include <vector>

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far

    std::default_random_engine re;
    std::uniform_real_distribution<double> dist(0,1);

    while (m < n)
    {
        double u = dist(re); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}
千年*琉璃梦 2024-07-16 10:46:59

受到 @John D. Cook 的回答的启发,我在 Nim 中编写了一个实现。 起初我很难理解它是如何工作的,所以我进行了广泛的评论,还包括一个例子。 也许这有助于理解这个想法。 另外,我还稍微更改了变量名称。

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i

Inspired by @John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i
我一向站在原地 2024-07-16 10:46:59

当总体规模远大于样本规模时,上述算法变得低效,因为它们的复杂度为 O(n),n 为人口规模。

当我还是一名学生时,我编写了一些无需放回的均匀采样算法,其平均复杂度 O(s log s),其中 O(s log s) >s 是样本大小。 下面是二叉树算法的代码,在 R 中平均复杂度为 O(s log s):

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

该算法的复杂度为讨论于:
鲁赞金,PS; Voytishek,AV 关于随机选择算法的成本。 蒙特卡罗方法应用。 5(1999),没有。 1、39-54。
http://dx.doi.org/10.1515/mcma.1999.5.1.39

如果您觉得该算法有用,请参考。

也可以看看:
P. 古普塔,GP 巴塔查吉。 (1984) 一种无需放回的随机抽样的有效算法。 国际计算机数学杂志 16:4,第 201-209 页。
DOI:10.1080/00207168408803438

Teuhola, J. 和 Nevalainen, O. 1982。两种无需放回的随机采样的有效算法。 /IJCM/,11(2):127-140。
DOI:10.1080/00207168208803304

在上一篇论文中,作者使用哈希表并声称他们的算法具有 O(s) 复杂度。 还有一种快速哈希表算法,很快就会在 pqR(相当快的 R)中实现:
https://stat.ethz.ch/pipermail/r- devel/2017-10月/075012.html

When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.

When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

The complexity of this algorithm is discussed in:
Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54.
http://dx.doi.org/10.1515/mcma.1999.5.1.39

If you find the algorithm useful, please make a reference.

See also:
P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209.
DOI: 10.1080/00207168408803438

Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140.
DOI: 10.1080/00207168208803304

In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R):
https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

千秋岁 2024-07-16 10:46:59

我写了一篇无替换采样算法调查。 我可能有偏见,但我推荐我自己的算法,在下面用 C++ 实现,因为它为许多 k、n 值提供了最佳性能,并为其他值提供了可接受的性能。 假设randbelow(i)返回一个公平选择的小于i的随机非负整数。

void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
    auto t = n - k + 1;
    for (uint32_t i = 0; i < k; i++) {
        uint32_t r = randbelow(t + i);
        if (r < t) {
            result[i] = r;
        } else {
            result[i] = result[r - t];
        }
    }
    std::sort(result, result + k);
    for (uint32_t i = 0; i < k; i++) {
        result[i] += i;
    }
}

I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.

void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
    auto t = n - k + 1;
    for (uint32_t i = 0; i < k; i++) {
        uint32_t r = randbelow(t + i);
        if (r < t) {
            result[i] = r;
        } else {
            result[i] = result[r - t];
        }
    }
    std::sort(result, result + k);
    for (uint32_t i = 0; i < k; i++) {
        result[i] += i;
    }
}
酷遇一生 2024-07-16 10:46:59

此处介绍了另一种无需放回采样的算法。

它与 John D. Cook 在他的回答中以及 Knuth 所描述的相似,但它有不同的假设:总体规模未知,但样本可以容纳在内存中。 这被称为“Knuth 算法 S”。

引用rosettacode文章:

  1. 选择前 n 个项目作为可用的样本;
  2. 对于第 i 个项目,其中 i > n,有 n/i 随机机会保留它。 如果失败了,样本将保持不变。 如果
    否,让它随机 (1/n) 替换先前选择的 n 个之一
    样本项目。
  3. 对任何后续项目重复 #2。

Another algorithm for sampling without replacement is described here.

It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".

Quoting the rosettacode article:

  1. Select the first n items as the sample as they become available;
  2. For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If
    not, have it randomly (1/n) replace one of the previously selected n
    items of the sample.
  3. Repeat #2 for any subsequent items.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文