从庞大的列表中进行高效的随机抽样
我有一个包含大量值 (53,000,000+) 的数据文件,我想提取这些值的 n 个随机子集(例如 2,000,000)。我实现了一个 Perl 脚本,将列表拉入内存,使用 Fisher-Yates 方法 进行随机播放数组,然后打印出打乱列表中的前 n 个值。然而,即使在较小的测试集(50,000 个值)上,这种改组过程也需要大量时间。
我正在寻找一种更有效、可扩展的方法来识别大量值的随机子集并将其打印出来。有什么建议吗?
更新:根据答案和更多搜索,看起来正确的术语是“随机抽样”。
I have a data file with a large number of values (53,000,000+) and I would like to pull out a random subset of n of these values (say, 2,000,000). I implemented a Perl script that pulls the list into memory, uses the Fisher-Yates method to shuffle the array, and then prints out the first n values in the shuffled list. However, this shuffling process is taking a lot of time, even on much smaller test sets (50,000 values).
I'm looking for a more efficient, scalable way to identify a random subset of a huge set of values and print it out. Any suggestions?
Update: Based on the answers and some more searching, it looks like the correct terminology is "random sampling".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
详细说明aix上面的答案,要从项目流中选择
k
,请一次读取一个项目。将前k
项保留在集合S
中。现在,当读取第S中均匀随机选择一个项目
m
项I
(现在为m>k
)时,以概率k/m
保留它>。如果您确实保留它,请从U
,并将U
替换为I
。证明以相同概率生成大小为 k 的所有子集是基于对 m 的归纳。请注意,您不需要提前知道
n
(项目总数),并且每一步的S
都是合适的。该算法是“流式”的——它不需要存储所有项目,或进行第二次传递。Elaborating on aix's answer above, to choose
k
out of a stream of items, read the items one at a time. Keep the firstk
items in a setS
.Now when reading the
m
-th itemI
(m>k
now), keep it with probabilityk/m
. If you do keep it, select an itemU
uniformly at random fromS
, and replaceU
withI
.The proof that this yields all subsets of size
k
with equal probability is based on induction onm
. Note that you don't need to known
(the total number of items) in advance, and thatS
at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.首先,检查您的随机播放的实现。如果实施正确,应该会给你线性时间。另外,修改算法以在打乱所需数量的元素后停止:(实际上和理论上)不需要打乱比实际输出更多的数字。
如果您要求 k 个数字,这将花费您 k 个基本操作。我怀疑你能做得比这更好。
First, check your implementation of the shuffle. If implemented correctly that should give you linear time. Also, modify the algorithm to stop after the desired number of elements have been shuffled: there's no need (practically and theoretically) to shuffle more numbers than you actually output.
If you ask for k numbers this will then cost you k elemental operations. I doubt you can do a lot better than that.
不要洗牌,它不必要地昂贵。
有一个简单的线性算法 在 Jon Bentley 的 “编程珍珠”(Bentley 说他是从 Knuth 的 “半数值算法”)。请改用此方法。
有一些 Perl 实现:
Don't shuffle, it's unnecessarily expensive.
There's a simple linear algorithm discussed in Jon Bentley's "Programming Pearls" (which Bentley says he learnt from Knuth's "Seminumerical Algorithms"). Use this method instead.
There are some Perl implementations about:
读取和洗牌数组会涉及大量不必要的数据移动。
这里有一些想法:
一:当你说你需要一个随机子集时,在这种情况下“随机”到底是什么意思?我的意思是,记录是否按任何特定顺序排列,或者该顺序与您尝试随机化的内容相关吗?
因为我的第一个想法是,如果记录没有任何相关顺序,那么您可以通过简单地计算总大小除以样本大小,然后选择每条第 n 条记录来获得随机选择。例如,如果您有 5300 万条记录,并且想要 200 万条记录,则取 5300 万条/200 万条 ~= 26,因此每读取 26 条记录。
第二:如果这还不够,更严格的解决方案是生成 0 到 5300 万范围内的 200 万个随机数,确保不重复。
2-A:如果您的样本量与记录总数相比很小,例如您只是挑选了几百或几千条,那么我会生成一个包含任意多个条目的数组,并且对于每个条目,将其与之前的所有条目进行比较以检查是否有重复项。如果重复,则循环并重试,直到找到唯一值。
2-B:假设您的数字不仅仅是示例,而是实际值,那么您的样本量与总人口相比就很大。在这种情况下,考虑到现代计算机上有充足的内存,您应该能够通过创建一个由 5300 万个布尔值组成的数组(初始化为 false)来高效地完成此操作,当然,每个布尔值代表一条记录。然后循环运行 200 万次。对于每次迭代,生成一个 0 到 5300 万之间的随机数。检查数组中相应的布尔值:如果为 false,则将其设置为 true。如果为真,则生成另一个随机数并重试。
三:或者等等,考虑到相对较大的百分比,这里有一个更好的主意:计算要包含的记录的百分比。然后循环遍历所有记录的计数器。对于每个,生成一个 0 到 1 之间的随机数,并将其与所需的百分比进行比较。如果较少,请读取该记录并将其包含在示例中。如果更大,则跳过该记录。
如果获取样本记录的确切数量很重要,您可以重新计算每条记录的百分比。例如,为了使示例简单,让我们假设您想要 100 条记录中的 10 条:
您从 10 / 100 = .1 开始,因此我们生成一个随机数,假设它出现 0.04。 .04<.1,因此我们包括记录 #1。
现在我们重新计算百分比。我们想要剩余 99 条记录中的 9 条记录给出 9/99~=.0909 假设我们的随机数是 0.87。这更大,所以我们跳过记录 #2。
再重新计算一下。我们还需要剩余 98 条记录中的 9 条。所以神奇的数字是 9/98,无论它是什么。等等。
一旦我们得到了我们想要的尽可能多的记录,未来记录的概率将为零,所以我们永远不会超过。如果我们接近尾声并且没有拾取足够的记录,则概率将非常接近 100%。例如,如果我们仍然需要 8 条记录,而只剩下 8 条记录,则概率为 8/8=100%,因此我们将保证获取下一条记录。
Reading and shuffling the array would involve a lot of unnecessary data movement.
Here are a few ideas:
One: When you say you need a random subset, what exactly do you mean by "random" in this context? By which I mean, are the records in any particular order, or is the order relevant to whatever it is you are trying to randomize?
Because my first thought is that if the records are not in any relevant order, than you can get a random selection by simply calculating total size divided by sample size, and then selecting every n-th record. So for example, if you have 53 million records and you want a sample of 2 million, take 53 millions / 2 million ~= 26, so read every 26th record.
Two: If that's not adequate, a more rigorous solution would be to generate 2 million random numbers in the range of zero to 53 million, insuring no duplicates.
Two-A: If you're sample size was small compared to the total number of records, like if you were just picking out a few hundred or a few thousand, I'd generate an array of however many entries, and for each entry, compare it to all previous entries to check for duplicates. If it's a duplicate, loop around and try again until you find a unique value.
Two-B: Assuming your numbers are not just examples but the actual values, then your sample size is large compared to the total population. In that case, given the ample memory on modern computers, you should be able to do this efficiently by creating an array of 53 million booleans initialized to false, each, of course, representing one record. Then run through a loop 2 million times. For each iteration, generate a random number from 0 to 53 million. Check the corresponding boolean in the array: If it's false, set it to true. If it's true, generate another random number and try again.
Three: Or wait, here's a better idea yet, given the relatively large percentage: Calculate the percentage of records you want to include. Then loop through a counter of all the records. For each, generate a random number from 0 to 1 and compare it to the desired percentage. If it's less, read that record and include it in the sample. If it's greater, skip the record.
If it's important to get the exact number of sample records, you can recalculate the percentage for each record. For example -- and to keep the example simple, let's pretend you want 10 out of 100 records:
You'd start with 10 / 100 = .1 So we generate a random number, say it comes up .04. .04<.1, so we include record #1.
Now we recalculate the percentage. We want 9 more records out of 99 remaining gives 9/99~=.0909 Say our random number is .87. That's greater, so we skip record #2.
Recalculate again. We still need 9 records out of 98 remaining. So the magic number is 9/98, whatever that comes to. Etc.
Once we've got as many records as we want, the probability for future records will be zero, so we'll never go over. If we near the end and haven't picked up enough records, the probability will get very close to 100%. Like, if we still need 8 records and there are only 8 records left, the probability will be 8/8=100% so we'll be guaranteed to take the next record.