从数据列表生成随机序列的最快方法是什么?

发布于 2024-09-10 15:51:29 字数 512 浏览 8 评论 0原文

假设我有一个数据列表: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 其中 n = 10 个元素

我想随机选择该集合中的 k 个元素来形成一个子列表,假设 k = 5。

在这种情况下,我最终可能会得到一个看起来像 {9, 3, 5, 2, 7} 的子列表,

我可以通过以下方式完成此操作:

  • 随机确定列表内的偏移量,介于 0 和列表的当前大小减 1
  • 将该元素追加到我的子列表中
  • 从原始列表中删除该元素
  • 重复直到找到所需的大小

这样做的问题是,随着原始列表的增长,偏移量和删除时间也会增长,并且对于对于任何非常大的列表(例如超过 1,000,000 个元素),执行此算法需要相当长的时间。

是否有更快的方法从给定数据列表生成随机序列?对于这个问题,应该暂且搁置随机数生成器的实现,而应重点关注如何在所提出的算法中使用 RNG 结果。

有什么想法吗?

现在我正在使用 C++ STL 列表

Let's say that I have a list of data: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} where n = 10 elements

I'd like to randomly choose k elements of this set to form a sublist, say k = 5.

In that case, I could end up with a sublist that looks like {9, 3, 5, 2, 7}

I could accomplish this by:

  • Randomly determining an offset within the list, between 0 and the current size of the list minus 1
  • Appending that element to my sublist
  • Erasing that element from the original list
  • Repeat until the desired size is found

The problem with this is that as the original list grows the offset and deletion time grows as well, and for any significantly large list (say over 1,000,000 elements), it takes quite a long time to perform this algorithm.

Is there a faster way to generate a random sequence from a list of given data? The implementation of the random number generator should be set aside for this problem, instead, focusing on how the RNG result is used in a proposed algorithm.

Any thoughts?

Right now I'm using the C++ STL list

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

为你鎻心 2024-09-17 15:51:30

使用 OutputIterators 和 std::random_shuffle 的最小示例。请注意,该算法将修改您的原始输入,因此在调用该函数之前制作一个副本可能是合理的。

#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>

template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
  std::random_shuffle(begin, end);
  It end2 = begin;
  std::advance(end2, n);
  std::copy(begin, end2, out);
}

int main() {
  std::vector<int> a;
  int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  take_random_n(b, b + 10, std::back_inserter(a), 4);
  for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
    std::cout << *it << " ";
}

A minimal example using OutputIterators and std::random_shuffle. Notice that the algorithm will modify your original input, so it could be reasonable to make a copy before you call the function.

#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>

template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
  std::random_shuffle(begin, end);
  It end2 = begin;
  std::advance(end2, n);
  std::copy(begin, end2, out);
}

int main() {
  std::vector<int> a;
  int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  take_random_n(b, b + 10, std::back_inserter(a), 4);
  for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
    std::cout << *it << " ";
}
寄风 2024-09-17 15:51:30

或者您可以通过以下方式完成此操作:

  • 随机确定内的偏移量
    列表,介于 0 和当前之间
    列表的大小。
  • 将该元素附加到您的
    子列表。
  • 重复直到子列表可能足够长以包含正确数量的元素。例如,如果您从 1,000,000 个元素中选择 10 个,则 10 个子列表可能就足够长了。您不需要非常准确地计算必须选择的额外元素数量。
  • 现在检查子列表中的所有元素是否都不同。如果没有,请删除重复项。如果您的子列表现在太短,请从主列表中选择更多内容。如果没有,你就完成了。

我不确定为什么要从主列表中删除所选元素,但如果这是必要的,您可以在构建子列表后执行此操作。

我不知道这种方法的性能如何与建议的 10^6 元素列表的 random_shuffle 的性能相比。

Or you could accomplish this by:

  • Randomly determining an offset within
    the list, between 0 and the current
    size of the list.
  • Appending that element to your
    sublist.
  • Repeat until the sublist is probably long enough to contain the right number of elements. For example, if you are choosing 10 out of 1,000,000 elements a sublist of 10 is probably long enough. You don't need to be hyper-accurate in your calculation of what number of extra elements you have to choose
  • Now check that all elements in the sublist are different. If not, delete the duplicates. If your sublist is now too short choose some more from the main list. If not, you're done.

I'm not sure why you want to delete the chosen elements from the main list, but if that is essential you could do it after constructing the sublist.

And I haven't a clue about how the performance of this approach will rate against the performance of the of the suggested random_shuffle of a list of 10^6 elements.

你怎么敢 2024-09-17 15:51:30

打乱列表,然后获取第一个(或最后一个)k 个元素。如果您使用 O(n) 算法,例如 Fisher-Yates shuffle,那么整个过程就是O(n)。

Shuffle the list, then take the first (or last) k elements. If you use a O(n) algorithm like the Fisher-Yates shuffle, then the whole process is O(n).

不念旧人 2024-09-17 15:51:30

您可以使用 std::random_shuffle 对其进行随机播放,然后只需复制第一个即可您想要添加到新列表中的元素。

You could shuffle it with std::random_shuffle and then just copy the first however many elements you want into a new list.

苍景流年 2024-09-17 15:51:30

使用某种算法对数组进行打乱
然后您可以从数组的开头查看随机元素。

Shuffle your array using some algorithm
Then you can peek random elements from the beginning of array.

吖咩 2024-09-17 15:51:30

为列表中的每个条目分配一个随机数,然后按随机数对列表进行排序。选择您想要的前 n 个条目。

Assign a random number to each entry in your list, then sort the list by random number. Pick off the first n entries you want.

悲念泪 2024-09-17 15:51:30

大多数答案建议对初始容器进行洗牌。如果你不想修改它,你仍然可以使用这种方法,但你首先需要复制容器。 @pmr 的解决方案(这很好,因为他将其变成了一个函数)将变为:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename std::iterator_traits<InputIterator>::value_type value_type;

    std::vector<value_type> shufflingVec(first, last);

    std::random_shuffle(shufflingVec.begin(), shufflingVec.end());

    std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}

但是,如果包含的元素很重并且需要一些时间来复制,则复制整个容器可能会非常昂贵。在这种情况下,最好对索引列表进行混洗:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename 
        std::iterator_traits<InputIterator>::value_type      value_type;
    typedef typename 
        std::iterator_traits<InputIterator>::difference_type difference_type;

    difference_type size = std::distance(first, last);

    std::vector<value_type> indexesVec(
        boost::counting_iterator<size_t>(0),
        boost::counting_iterator<size_t>(size));

    // counting_iterator generates incrementing numbers. Easy to implement if you
    // can't use Boost

    std::random_shuffle(indexesVec.begin(), indexesVec.end());

    for (Size i = 0 ; i < n ; ++i)
    {
        *result++ = *std::advance(first, indexesVec[i]);
    }
}

// Disclaimer: I have not tested the code above!

您会注意到,后一种解决方案的执行方式将根据您使用的迭代器的类型有很大不同:使用随机访问迭代器(如指针或向量向量) ;::iterator),这没问题,但是对于其他类型的迭代器,使用 std::distance 以及对 std::advance 的大量调用code> 可能会产生相当大的开销。

Most answers propose to shuffle the initial container. If you don't want it to be modified, you can still use this approach, but you first need to copy the container. The solution of @pmr (which is nice because he makes it into a function) would then become:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename std::iterator_traits<InputIterator>::value_type value_type;

    std::vector<value_type> shufflingVec(first, last);

    std::random_shuffle(shufflingVec.begin(), shufflingVec.end());

    std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}

However, copying the entire container can be quite expensive if the elements contained are heavy and take some time to copy. In this case, you can be better off shuffling a list of indexes:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename 
        std::iterator_traits<InputIterator>::value_type      value_type;
    typedef typename 
        std::iterator_traits<InputIterator>::difference_type difference_type;

    difference_type size = std::distance(first, last);

    std::vector<value_type> indexesVec(
        boost::counting_iterator<size_t>(0),
        boost::counting_iterator<size_t>(size));

    // counting_iterator generates incrementing numbers. Easy to implement if you
    // can't use Boost

    std::random_shuffle(indexesVec.begin(), indexesVec.end());

    for (Size i = 0 ; i < n ; ++i)
    {
        *result++ = *std::advance(first, indexesVec[i]);
    }
}

// Disclaimer: I have not tested the code above!

You'll notice that the latter solution will perform very differently depending on the kind of iterators you use: with random access iterators (like pointers or vector<T>::iterator), it will be ok, but with other types of iterators, the use of std::distance and the numerous calls to std::advance can induce quite an overhead.

墟烟 2024-09-17 15:51:30

我的 2 美分(仅使用 stl 并且最多需要前向迭代器):

//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
  DiffType operator()( DiffType i ){
    return std::rand() % i;
  }
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
//  - the first starts at the begining of the range
// and moves one element at a time for n times
//  - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;

FwdIter first = begin;
FwdIter second = begin;

difference_type dist  = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second

  do{
    offset = Func( dist ) % dist;
    index += offset;
    if( index >= dist ){
      second = begin;
      index = offset = index % dist;
    }
    std::advance( second, offset );

    std::swap( *first++, *second );
  }while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );

  std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
  std::cout << std::endl;
  RandomlyRandom< std::list< int >::difference_type > rand;

  for( int i = 0; i < 100;  i++ ){
    random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
    std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
    std::cout << std::endl;
  }

  return 0;
}
//-----------------------------------------------------------------------------

My 2 cents (using stl only & needing at most forward iterators):

//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
  DiffType operator()( DiffType i ){
    return std::rand() % i;
  }
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
//  - the first starts at the begining of the range
// and moves one element at a time for n times
//  - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;

FwdIter first = begin;
FwdIter second = begin;

difference_type dist  = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second

  do{
    offset = Func( dist ) % dist;
    index += offset;
    if( index >= dist ){
      second = begin;
      index = offset = index % dist;
    }
    std::advance( second, offset );

    std::swap( *first++, *second );
  }while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );

  std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
  std::cout << std::endl;
  RandomlyRandom< std::list< int >::difference_type > rand;

  for( int i = 0; i < 100;  i++ ){
    random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
    std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
    std::cout << std::endl;
  }

  return 0;
}
//-----------------------------------------------------------------------------
吐个泡泡 2024-09-17 15:51:29

我会使用 random_shuffle。您可以通过提供第三个参数来更改生成器。

它需要随机访问迭代器,因此您可以切换到 std::vector(通常比 std::list 更优越,可以说是更糟糕的容器) ,或者只是对某个数组进行操作。我将演示两者:

int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10); 

// or

std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());

现在一切都是随机顺序的,只需将前 k 元素视为您的子集:

// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);

// or
data.resize(k); // shrink vector

请注意,在另一个问题中,Jerry 分享了一种做你想做的事情的绝佳方法

I would use random_shuffle. You can change the generator by supplying a third parameter.

It requires random access iterators, so you can either switch to a std::vector (which is generally far superior and preferred over std::list, arguably the worse container), or just operate on some array. I'll demonstrate both:

int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10); 

// or

std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());

Now everything is in random order, just treat the fist k elements as your subset:

// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);

// or
data.resize(k); // shrink vector

Note that in another question, Jerry shares an excellent way of doing what you want.

九厘米的零° 2024-09-17 15:51:29

http://en.wikipedia.org/wiki/Fisher%E2% 80%93Yates_shuffle#The_modern_algorithm

查看示例 > 下的内容现代方法

您无需重新整理您的整个列表。 O(k)(优于 O(n))

http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm

Look under Examples > Modern method

You don't need to shuffle your entire list. O(k) (better than O(n))

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文