随机且唯一的子集生成

发布于 2024-08-04 13:46:14 字数 271 浏览 3 评论 0原文

假设我们有从 1 到 25 的数字,我们必须选择 15 个数字的集合。

如果我是对的,可能的集合是 3268760。

在这 3268760 个选项中,您必须生成 100000 个子集,

生成 100000 个唯一且随机的子集的最佳方法是什么?

有没有办法、算法来做到这一点?

如果没有,检测重复项的最佳选择是什么?

我计划在 PHP 上执行此操作,但通用解决方案就足够了, 任何不涉及太多“学术”(更实用)的参考都会对我有很大帮助。

Lets say we have numbers from 1 to 25 and we have to choose sets of 15 numbers.

The possible sets are, if i'm right 3268760.

Of those 3268760 options, you have to generate say 100000

What would be the best way to generate 100000 unique and random of that subsets?

Is there a way, an algorithm to do that?

If not, what would be the best option to detect duplicates?

I'm planning to do this on PHP but a general solution would be enough,
and any reference not to much 'academic' (more practical) would help me a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

断肠人 2024-08-11 13:46:14

有一种方法可以生成随机的子集样本,保证不重复,使用 O(1) 存储,并且可以随时重新生成。首先,编写一个函数 根据给定的词法生成一个组合索引。其次,使用 第一个 Combin(n, m) 整数的伪随机排列,以随机顺序逐步遍历这些组合。只需将数字 0...100000 输入到排列中,使用排列的输出作为组合生成器的输入,然后处理生成的组合。

There is a way to generate a sample of the subsets that is random, guaranteed not to have duplicates, uses O(1) storage, and can be re-generated at any time. First, write a function to generate a combination given its lexical index. Second, use a pseudorandom permutation of the first Combin(n, m) integers to step through those combinations in a random order. Simply feed the numbers 0...100000 into the permutation, use the output of the permutation as input to the combination generator, and process the resulting combination.

时光是把杀猪刀 2024-08-11 13:46:14

这是基于 mjv 答案的 PHP 解决方案,这就是我的想法。如果运行完整 100k 组,您确实会看到很多冲突。然而,我很难设计一个系统来避免它们。相反,我们只是相当快地检查它们。

我会考虑更好的解决方案......在这台笔记本电脑上,我可以在 5 秒内完成 10k 组,在 20 秒内完成 20k 组。 100k需要几分钟。

这些集合表示为(32 位)整数。

<?PHP
    /* (c) 2009 tim - anyone who finds a use for this is very welcome to use it with no restrictions unless they're making a weapon */

    //how many sets shall we generate?
    $gNumSets = 1000;

    //keep track of collisions, just for fun.
    $gCollisions = 0;

    $starttime = time();

    /**
     * Generate and return an integer with exactly 15 of the lower 25 bits set (1) and the other 10 unset (0)
     */ 
    function genSetHash(){
      $hash = pow(2,25)-1;

      $used = array();

      for($i=0;$i<10;){

        //pick a bit to turn off
        $bit = rand(0,24);

        if (! in_array($bit,$used)){
          $hash =  ( $hash & ~pow(2,$bit) );
          $i++;  
          $used[] = $bit;  
        }
      }
      return  $hash;
    }

    //we store our solution hashes in here.  
    $solutions = array();

    //generate a bunch of solutions.
    for($i=0;$i<$gNumSets;){
      $hash = genSetHash(); 

      //ensure no collisions
      if (! in_array($hash,$solutions)){
        $solutions[] = $hash;
        //brag a little.
        echo("Generated $i random sets in " . (time()-$starttime) . " seconds.\n");
        $i++;
      }else { 
        //there was a collision. There will generally be more the longer the process runs.
        echo "thud.\n"; 
        $gCollisions++;
      }
    }

    // okay, we're done with the hard work.  $solutions contains a bunch of
    // unique, random, ints in the right range.  Everything from here on out
    // is just output.

    //takes an integer with 25 significant digits, and returns an array of 15 numbers between 1 and 25
    function hash2set($hash){
      $set = array();
      for($i=0;$i<24;$i++){  
        if ($hash & pow(2,$i)){
          $set[] = $i+1;
        }
      }
      return $set;
    }

    //pretty-print our sets.
    function formatSet($set){
      return "[ " . implode(',',$set) . ']';
    }

    //if we wanted to print them, 
    foreach($solutions as $hash){
      echo formatSet(hash2set($hash)) . "\n";
    }

    echo("Generated $gNumSets unique random sets in " . (time()-$starttime) . " seconds.\n");

    echo "\n\nDone.  $gCollisions collisions.\n";

我认为这一切都是正确的,但已经晚了,我已经喝了几瓶非常好的啤酒了。

Here's a solution in PHP based on mjv's answer, which is how I was thinking about it. If you run it for a full 100k sets, you do indeed see a lot of collisions. However, I'm hard pressed to devise a system to avoid them. Instead, we just check them fairly quickly.

I'll think about better solutions ... on this laptop, I can do 10k sets in 5 seconds, 20k sets in under 20 seconds. 100k takes several minutes.

The sets are represented as (32-bit) ints.

<?PHP
    /* (c) 2009 tim - anyone who finds a use for this is very welcome to use it with no restrictions unless they're making a weapon */

    //how many sets shall we generate?
    $gNumSets = 1000;

    //keep track of collisions, just for fun.
    $gCollisions = 0;

    $starttime = time();

    /**
     * Generate and return an integer with exactly 15 of the lower 25 bits set (1) and the other 10 unset (0)
     */ 
    function genSetHash(){
      $hash = pow(2,25)-1;

      $used = array();

      for($i=0;$i<10;){

        //pick a bit to turn off
        $bit = rand(0,24);

        if (! in_array($bit,$used)){
          $hash =  ( $hash & ~pow(2,$bit) );
          $i++;  
          $used[] = $bit;  
        }
      }
      return  $hash;
    }

    //we store our solution hashes in here.  
    $solutions = array();

    //generate a bunch of solutions.
    for($i=0;$i<$gNumSets;){
      $hash = genSetHash(); 

      //ensure no collisions
      if (! in_array($hash,$solutions)){
        $solutions[] = $hash;
        //brag a little.
        echo("Generated $i random sets in " . (time()-$starttime) . " seconds.\n");
        $i++;
      }else { 
        //there was a collision. There will generally be more the longer the process runs.
        echo "thud.\n"; 
        $gCollisions++;
      }
    }

    // okay, we're done with the hard work.  $solutions contains a bunch of
    // unique, random, ints in the right range.  Everything from here on out
    // is just output.

    //takes an integer with 25 significant digits, and returns an array of 15 numbers between 1 and 25
    function hash2set($hash){
      $set = array();
      for($i=0;$i<24;$i++){  
        if ($hash & pow(2,$i)){
          $set[] = $i+1;
        }
      }
      return $set;
    }

    //pretty-print our sets.
    function formatSet($set){
      return "[ " . implode(',',$set) . ']';
    }

    //if we wanted to print them, 
    foreach($solutions as $hash){
      echo formatSet(hash2set($hash)) . "\n";
    }

    echo("Generated $gNumSets unique random sets in " . (time()-$starttime) . " seconds.\n");

    echo "\n\nDone.  $gCollisions collisions.\n";

I think it's all correct, but it's late, and I've been enjoying several very nice bottles of beer.

倾城泪 2024-08-11 13:46:14

它们必须是真正随机的吗?还是看似随机?

选择:生成一个包含所有 25 个元素的集合 - 使用 Fisher-Yates / Knuth 洗牌“洗牌”前 15 个元素,然后检查您之前是否见过前 15 个元素的排列。如果是这样,请忽略并重试。

重复项:您有 25 个存在或不存在的值 - 这可以简单地散列为整数值(如果第一个元素存在,则添加 2^0,如果第二个元素存在,则添加 2^1 等 - 它可以是直接表示为 25 位数字),因此您可以轻松检查是否已经看过它。

您会遇到相当多的冲突,但如果它不是性能关键片段,那么它可能是可行的。

Do they have to be truly random? Or seemingly random?

Selection: generate a set with all 25 - "shuffle" the first 15 elements using Fisher-Yates / the Knuth shuffle, and then check if you've seen that permutation of the first 15 elements before. If so, disregard, and retry.

Duplicates: You have 25 values that are there or not - this can be trivially hashed to an integer value (if the 1st element is present, add 2^0, if the second is, add 2^1, etc. - it can be directly represented as a 25 bit number), so you can check easily if you've seen it already.

You'll get a fair bit of collisions, but if it's not a performance critical snippet, it might be doable.

江南月 2024-08-11 13:46:14

您环境的随机数生成器 (RNG) 将为您提供均匀分布在特定范围内的随机数。这种类型的分布通常是所需要的,例如,如果您的子集模拟彩票抽奖,但重要的是要提及这一事实,以防您正在建模,例如在中学场地上发现的人的年龄......

鉴于此 RNG您可以“绘制”10(或 15,请参阅下文)1 到 25 之间的数字。这可能需要您乘以(并舍入)生成器生成的随机数,并且忽略高于 25 的数字(即再次绘制) ),具体取决于与 RNG 相关的确切 API,但在给定范围内获取绘图也是微不足道的。当数字再次出现时,您还需要重新抽奖。

我建议您只获取 10 个数字,因为可以从 1-25 完整序列中删除这些数字,以生成一组 15 个数字。换句话说,放入的画 15 与取出的画 10 是相同的...

接下来您需要断言集合的唯一性。您可以使用散列来唯一地标识每个集合,而不是存储整个集合。这应该少于 25 位,因此可以存储在 32 位整数上。然后,您需要高效存储最多 100,000 个这些值;除非你想将其存储在数据库中。

对于从所有可能的集合中取出 100,000 个集合的唯一性问题,碰撞的概率似乎相对较低。编辑:哎呀...我很乐观...这个概率并没有那么低,大约有1.5%的机会在绘制第50,000次之后开始发生碰撞,将会有相当多的碰撞,足以保证系统将它们排除...

The random number generator (RNG) of your environment will supply you random numbers that are evenly distributed in a particular range. This type of distribution is often what is needed, say if your subset simulate lottery drawings, but it is important to mention this fact in case your are modeling say the age of people found on the grounds of a middle school...

Given this RNG you can "draw" 10 (or 15, read below) numbers between 1 and 25. This may require that you multiply (and round) the random number produced by the generator, and that you ignore numbers that are above 25 (i.e. draw again), depending on the exact API associated with the RNG, but again getting a drawing in a given range is trivial. You will also need to re-draw when a number comes up again.

I suggest you get 10 numbers only, as these can be removed from the 1-25 complete sequence to produce a set of 15. In other words drawing 15 to put in is the same drawing 10 to take out...

Next you need to assert the uniqueness of the sets. Rather than storing the whole set, you can use a hash to identify each set uniquely. This should take fewer that 25 bits, so can be stored on a 32 bits integer. You then need to have an efficient storage for up to 100,000 of these values; unless you want to store this in a database.

On this question of uniqueness of 100,000 sets taken out of all the possible sets, the probability of a collision seems relatively low. Edit: Oops... I was optimistic... This probability is not so low, with about 1.5% chance of a collision starting after drawing the 50,000th, there will be quite a few collisions, enough to warrant a system to exclude them...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文