当前位置：文江博客话题详情

从数据列表生成随机序列的最快方法是什么？

发布于 2024-09-10 15:51:29 字数 512 浏览 3 评论 0原文

假设我有一个数据列表： {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 其中 n = 10 个元素

我想随机选择该集合中的 k 个元素来形成一个子列表，假设 k = 5。

在这种情况下，我最终可能会得到一个看起来像 {9, 3, 5, 2, 7} 的子列表，

我可以通过以下方式完成此操作：

随机确定列表内的偏移量，介于 0 和列表的当前大小减 1
将该元素追加到我的子列表中
从原始列表中删除该元素
重复直到找到所需的大小

这样做的问题是，随着原始列表的增长，偏移量和删除时间也会增长，并且对于对于任何非常大的列表（例如超过 1,000,000 个元素），执行此算法需要相当长的时间。

是否有更快的方法从给定数据列表生成随机序列？对于这个问题，应该暂且搁置随机数生成器的实现，而应重点关注如何在所提出的算法中使用 RNG 结果。

有什么想法吗？

现在我正在使用 C++ STL 列表

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为你鎻心 2024-09-17 15:51:30

使用 OutputIterators 和 std::random_shuffle 的最小示例。请注意，该算法将修改您的原始输入，因此在调用该函数之前制作一个副本可能是合理的。

#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>

template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
  std::random_shuffle(begin, end);
  It end2 = begin;
  std::advance(end2, n);
  std::copy(begin, end2, out);
}

int main() {
  std::vector<int> a;
  int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  take_random_n(b, b + 10, std::back_inserter(a), 4);
  for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
    std::cout << *it << " ";
}

A minimal example using OutputIterators and std::random_shuffle. Notice that the algorithm will modify your original input, so it could be reasonable to make a copy before you call the function.

#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>

template<class It, class OutIt>
void take_random_n(It begin, It end, OutIt out, size_t n) {
  std::random_shuffle(begin, end);
  It end2 = begin;
  std::advance(end2, n);
  std::copy(begin, end2, out);
}

int main() {
  std::vector<int> a;
  int b[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  take_random_n(b, b + 10, std::back_inserter(a), 4);
  for(std::vector<int>::iterator it = a.begin(); it != a.end(); ++it)
    std::cout << *it << " ";
}

回复收藏 0 原文

寄风 2024-09-17 15:51:30

或者您可以通过以下方式完成此操作：

随机确定内的偏移量
列表，介于 0 和当前之间
列表的大小。
将该元素附加到您的
子列表。
重复直到子列表可能足够长以包含正确数量的元素。例如，如果您从 1,000,000 个元素中选择 10 个，则 10 个子列表可能就足够长了。您不需要非常准确地计算必须选择的额外元素数量。
现在检查子列表中的所有元素是否都不同。如果没有，请删除重复项。如果您的子列表现在太短，请从主列表中选择更多内容。如果没有，你就完成了。

我不确定为什么要从主列表中删除所选元素，但如果这是必要的，您可以在构建子列表后执行此操作。

我不知道这种方法的性能如何与建议的 10^6 元素列表的 random_shuffle 的性能相比。

回复收藏 0 原文

你怎么敢 2024-09-17 15:51:30

打乱列表，然后获取第一个（或最后一个）k 个元素。如果您使用 O(n) 算法，例如 Fisher-Yates shuffle，那么整个过程就是O(n)。

回复收藏 0 原文

不念旧人 2024-09-17 15:51:30

您可以使用 std::random_shuffle 对其进行随机播放，然后只需复制第一个即可您想要添加到新列表中的元素。

回复收藏 0 原文

苍景流年 2024-09-17 15:51:30

使用某种算法对数组进行打乱
然后您可以从数组的开头查看随机元素。

回复收藏 0 原文

吖咩 2024-09-17 15:51:30

为列表中的每个条目分配一个随机数，然后按随机数对列表进行排序。选择您想要的前 n 个条目。

回复收藏 0 原文

悲念泪 2024-09-17 15:51:30

大多数答案建议对初始容器进行洗牌。如果你不想修改它，你仍然可以使用这种方法，但你首先需要复制容器。 @pmr 的解决方案（这很好，因为他将其变成了一个函数）将变为：

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename std::iterator_traits<InputIterator>::value_type value_type;

    std::vector<value_type> shufflingVec(first, last);

    std::random_shuffle(shufflingVec.begin(), shufflingVec.end());

    std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}

但是，如果包含的元素很重并且需要一些时间来复制，则复制整个容器可能会非常昂贵。在这种情况下，最好对索引列表进行混洗：

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename 
        std::iterator_traits<InputIterator>::value_type      value_type;
    typedef typename 
        std::iterator_traits<InputIterator>::difference_type difference_type;

    difference_type size = std::distance(first, last);

    std::vector<value_type> indexesVec(
        boost::counting_iterator<size_t>(0),
        boost::counting_iterator<size_t>(size));

    // counting_iterator generates incrementing numbers. Easy to implement if you
    // can't use Boost

    std::random_shuffle(indexesVec.begin(), indexesVec.end());

    for (Size i = 0 ; i < n ; ++i)
    {
        *result++ = *std::advance(first, indexesVec[i]);
    }
}

// Disclaimer: I have not tested the code above!

您会注意到，后一种解决方案的执行方式将根据您使用的迭代器的类型有很大不同：使用随机访问迭代器（如指针或向量向量） ;::iterator），这没问题，但是对于其他类型的迭代器，使用 std::distance 以及对 std::advance 的大量调用code> 可能会产生相当大的开销。

Most answers propose to shuffle the initial container. If you don't want it to be modified, you can still use this approach, but you first need to copy the container. The solution of @pmr (which is nice because he makes it into a function) would then become:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename std::iterator_traits<InputIterator>::value_type value_type;

    std::vector<value_type> shufflingVec(first, last);

    std::random_shuffle(shufflingVec.begin(), shufflingVec.end());

    std::copy(shufflingVec.begin(), shufflingVec.begin() + n, result);
}

However, copying the entire container can be quite expensive if the elements contained are heavy and take some time to copy. In this case, you can be better off shuffling a list of indexes:

template <typename InputIterator, typename Size, typename OutputIterator>
void take_random_n(InputIterator first, InputIterator  last, 
                   Size          n,     OutputIterator result)
{
    typedef typename 
        std::iterator_traits<InputIterator>::value_type      value_type;
    typedef typename 
        std::iterator_traits<InputIterator>::difference_type difference_type;

    difference_type size = std::distance(first, last);

    std::vector<value_type> indexesVec(
        boost::counting_iterator<size_t>(0),
        boost::counting_iterator<size_t>(size));

    // counting_iterator generates incrementing numbers. Easy to implement if you
    // can't use Boost

    std::random_shuffle(indexesVec.begin(), indexesVec.end());

    for (Size i = 0 ; i < n ; ++i)
    {
        *result++ = *std::advance(first, indexesVec[i]);
    }
}

// Disclaimer: I have not tested the code above!

You'll notice that the latter solution will perform very differently depending on the kind of iterators you use: with random access iterators (like pointers or vector<T>::iterator), it will be ok, but with other types of iterators, the use of std::distance and the numerous calls to std::advance can induce quite an overhead.

回复收藏 0 原文

墟烟 2024-09-17 15:51:30

我的 2 美分（仅使用 stl 并且最多需要前向迭代器）：

//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
  DiffType operator()( DiffType i ){
    return std::rand() % i;
  }
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
//  - the first starts at the begining of the range
// and moves one element at a time for n times
//  - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;

FwdIter first = begin;
FwdIter second = begin;

difference_type dist  = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second

  do{
    offset = Func( dist ) % dist;
    index += offset;
    if( index >= dist ){
      second = begin;
      index = offset = index % dist;
    }
    std::advance( second, offset );

    std::swap( *first++, *second );
  }while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );

  std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
  std::cout << std::endl;
  RandomlyRandom< std::list< int >::difference_type > rand;

  for( int i = 0; i < 100;  i++ ){
    random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
    std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
    std::cout << std::endl;
  }

  return 0;
}
//-----------------------------------------------------------------------------

My 2 cents (using stl only & needing at most forward iterators):

//-----------------------------------------------------------------------------
#include <cstdlib>
//-----------------------------------------------------------------------------
#include <iostream>
#include <list>
#include <iterator>
#include <algorithm>
//-----------------------------------------------------------------------------
// random generator
template< typename DiffType >
struct RandomlyRandom{
  DiffType operator()( DiffType i ){
    return std::rand() % i;
  }
};
//-----------------------------------------------------------------------------
// we'll have two iterators:
//  - the first starts at the begining of the range
// and moves one element at a time for n times
//  - the second starts at random in the middle of the range
// and will move a random number of elements inside the range
//
// then we swap their values
template< typename FwdIter, typename Fn >
void random_shuffle_n( FwdIter begin, FwdIter end, Fn& Func, size_t n ){
typedef typename std::iterator_traits<FwdIter>::difference_type difference_type;

FwdIter first = begin;
FwdIter second = begin;

difference_type dist  = std::distance( begin, end );
difference_type offset = Func( dist ) % dist;
difference_type index = offset;
std::advance( second, offset ); // try to put some distance between first & second

  do{
    offset = Func( dist ) % dist;
    index += offset;
    if( index >= dist ){
      second = begin;
      index = offset = index % dist;
    }
    std::advance( second, offset );

    std::swap( *first++, *second );
  }while( n-- > 0 );
}
//-----------------------------------------------------------------------------
int main( int argc, char* argv[] ){
int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::list< int > lst( arr, arr + sizeof( arr ) / sizeof( arr[ 0 ] ) );

  std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
  std::cout << std::endl;
  RandomlyRandom< std::list< int >::difference_type > rand;

  for( int i = 0; i < 100;  i++ ){
    random_shuffle_n( lst.begin(), lst.end(), rand, 5 );
    std::copy( lst.begin(), lst.end(), std::ostream_iterator< int >( std::cout, " " ) ); 
    std::cout << std::endl;
  }

  return 0;
}
//-----------------------------------------------------------------------------

回复收藏 0 原文

吐个泡泡 2024-09-17 15:51:29

我会使用 random_shuffle。您可以通过提供第三个参数来更改生成器。

它需要随机访问迭代器，因此您可以切换到 std::vector（通常比 std::list 更优越，可以说是更糟糕的容器），或者只是对某个数组进行操作。我将演示两者：

int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10); 

// or

std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());

现在一切都是随机顺序的，只需将前 k 元素视为您的子集：

// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);

// or
data.resize(k); // shrink vector

请注意，在另一个问题中，Jerry 分享了一种做你想做的事情的绝佳方法。

I would use random_shuffle. You can change the generator by supplying a third parameter.

It requires random access iterators, so you can either switch to a std::vector (which is generally far superior and preferred over std::list, arguably the worse container), or just operate on some array. I'll demonstrate both:

int data[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::random_shuffle(data, data + 10); 

// or

std::vector data; // populate it
std::random_shuffle(data.begin(), data.end());

Now everything is in random order, just treat the fist k elements as your subset:

// now treat data[0] through data[k] as your random subset, or:
std::vector subset(data, data + k);

// or
data.resize(k); // shrink vector

Note that in another question, Jerry shares an excellent way of doing what you want.

回复收藏 0 原文

九厘米的零° 2024-09-17 15:51:29

http://en.wikipedia.org/wiki/Fisher%E2% 80%93Yates_shuffle#The_modern_algorithm

查看示例 > 下的内容现代方法

您无需重新整理您的整个列表。 O(k)（优于 O(n)）

回复收藏 0 原文

~没有更多了~

关于作者

屋檐

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

从数据列表生成随机序列的最快方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

从数据列表生成随机序列的最快方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。