从数据列表生成随机序列的最快方法是什么?
假设我有一个数据列表: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 其中 n = 10 个元素
我想随机选择该集合中的 k 个元素来形成一个子列表,假设 k = 5。
在这种情况下,我最终可能会得到一个看起来像 {9, 3, 5, 2, 7} 的子列表,
我可以通过以下方式完成此操作:
- 随机确定列表内的偏移量,介于 0 和列表的当前大小减 1
- 将该元素追加到我的子列表中
- 从原始列表中删除该元素
- 重复直到找到所需的大小
这样做的问题是,随着原始列表的增长,偏移量和删除时间也会增长,并且对于对于任何非常大的列表(例如超过 1,000,000 个元素),执行此算法需要相当长的时间。
是否有更快的方法从给定数据列表生成随机序列?对于这个问题,应该暂且搁置随机数生成器的实现,而应重点关注如何在所提出的算法中使用 RNG 结果。
有什么想法吗?
现在我正在使用 C++ STL 列表
Let's say that I have a list of data: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} where n = 10 elements
I'd like to randomly choose k elements of this set to form a sublist, say k = 5.
In that case, I could end up with a sublist that looks like {9, 3, 5, 2, 7}
I could accomplish this by:
- Randomly determining an offset within the list, between 0 and the current size of the list minus 1
- Appending that element to my sublist
- Erasing that element from the original list
- Repeat until the desired size is found
The problem with this is that as the original list grows the offset and deletion time grows as well, and for any significantly large list (say over 1,000,000 elements), it takes quite a long time to perform this algorithm.
Is there a faster way to generate a random sequence from a list of given data? The implementation of the random number generator should be set aside for this problem, instead, focusing on how the RNG result is used in a proposed algorithm.
Any thoughts?
Right now I'm using the C++ STL list
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
使用 OutputIterators 和 std::random_shuffle 的最小示例。请注意,该算法将修改您的原始输入,因此在调用该函数之前制作一个副本可能是合理的。
A minimal example using OutputIterators and
std::random_shuffle
. Notice that the algorithm will modify your original input, so it could be reasonable to make a copy before you call the function.或者您可以通过以下方式完成此操作:
列表,介于 0 和当前之间
列表的大小。
子列表。
我不确定为什么要从主列表中删除所选元素,但如果这是必要的,您可以在构建子列表后执行此操作。
我不知道这种方法的性能如何与建议的 10^6 元素列表的 random_shuffle 的性能相比。
Or you could accomplish this by:
the list, between 0 and the current
size of the list.
sublist.
I'm not sure why you want to delete the chosen elements from the main list, but if that is essential you could do it after constructing the sublist.
And I haven't a clue about how the performance of this approach will rate against the performance of the of the suggested random_shuffle of a list of 10^6 elements.
打乱列表,然后获取第一个(或最后一个)k 个元素。如果您使用 O(n) 算法,例如 Fisher-Yates shuffle,那么整个过程就是O(n)。
Shuffle the list, then take the first (or last) k elements. If you use a O(n) algorithm like the Fisher-Yates shuffle, then the whole process is O(n).
您可以使用 std::random_shuffle 对其进行随机播放,然后只需复制第一个即可您想要添加到新列表中的元素。
You could shuffle it with std::random_shuffle and then just copy the first however many elements you want into a new list.
使用某种算法对数组进行打乱
然后您可以从数组的开头查看随机元素。
Shuffle your array using some algorithm
Then you can peek random elements from the beginning of array.
为列表中的每个条目分配一个随机数,然后按随机数对列表进行排序。选择您想要的前 n 个条目。
Assign a random number to each entry in your list, then sort the list by random number. Pick off the first n entries you want.
大多数答案建议对初始容器进行洗牌。如果你不想修改它,你仍然可以使用这种方法,但你首先需要复制容器。 @pmr 的解决方案(这很好,因为他将其变成了一个函数)将变为:
但是,如果包含的元素很重并且需要一些时间来复制,则复制整个容器可能会非常昂贵。在这种情况下,最好对索引列表进行混洗:
您会注意到,后一种解决方案的执行方式将根据您使用的迭代器的类型有很大不同:使用随机访问迭代器(如指针或向量
向量) ;::iterator
),这没问题,但是对于其他类型的迭代器,使用std::distance
以及对std::advance
的大量调用code> 可能会产生相当大的开销。Most answers propose to shuffle the initial container. If you don't want it to be modified, you can still use this approach, but you first need to copy the container. The solution of @pmr (which is nice because he makes it into a function) would then become:
However, copying the entire container can be quite expensive if the elements contained are heavy and take some time to copy. In this case, you can be better off shuffling a list of indexes:
You'll notice that the latter solution will perform very differently depending on the kind of iterators you use: with random access iterators (like pointers or
vector<T>::iterator
), it will be ok, but with other types of iterators, the use ofstd::distance
and the numerous calls tostd::advance
can induce quite an overhead.我的 2 美分(仅使用 stl 并且最多需要前向迭代器):
My 2 cents (using stl only & needing at most forward iterators):
我会使用
random_shuffle
。您可以通过提供第三个参数来更改生成器。它需要随机访问迭代器,因此您可以切换到 std::vector(通常比 std::list 更优越,可以说是更糟糕的容器) ,或者只是对某个数组进行操作。我将演示两者:
现在一切都是随机顺序的,只需将前
k
元素视为您的子集:请注意,在另一个问题中,Jerry 分享了一种做你想做的事情的绝佳方法。
I would use
random_shuffle
. You can change the generator by supplying a third parameter.It requires random access iterators, so you can either switch to a
std::vector
(which is generally far superior and preferred overstd::list
, arguably the worse container), or just operate on some array. I'll demonstrate both:Now everything is in random order, just treat the fist
k
elements as your subset:Note that in another question, Jerry shares an excellent way of doing what you want.
http://en.wikipedia.org/wiki/Fisher%E2% 80%93Yates_shuffle#The_modern_algorithm
查看示例 > 下的内容现代方法
您无需重新整理您的整个列表。 O(k)(优于 O(n))
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm
Look under Examples > Modern method
You don't need to shuffle your entire list. O(k) (better than O(n))