如何随机迭代大范围?
我想随机迭代一个范围。每个值只会被访问一次,所有值最终都会被访问。例如:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
其中 f(x)
是对每个值进行运算的函数。 Fisher-Yates shuffle 用于有效地提供随机排序。
我的问题是 shuffle
需要在数组上进行操作,这并不酷,因为我正在处理天文数字的大数字。 Ruby 会快速消耗大量 RAM 来尝试创建一个巨大的数组。想象一下将 (0..9)
替换为 (0..99**99)
。这也是以下代码不起作用的原因:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
此代码非常幼稚,并且随着 tried
获取更多条目而很快耗尽内存。
什么样的算法可以完成我想做的事情?
[Edit1]:我为什么要这样做?我正在尝试耗尽 N 长度输入字符串的哈希算法的搜索空间,以查找部分冲突。我生成的每个数字都相当于一个唯一的输入字符串、熵等等。基本上,我使用 自定义字母表。
[Edit2]:这意味着上面示例中的 f(x)
是一种生成散列并将其与常量目标散列进行比较以发现部分冲突的方法。在调用 f(x)
后,我不需要存储 x
的值,因此内存应该随着时间的推移保持不变。
[Edit3/4/5/6]:进一步澄清/修复。
[解决方案]:以下代码基于@bta的解决方案。为了简洁起见,没有显示 next_prime
。它产生可接受的随机性,并且只访问每个数字一次。有关更多详细信息,请参阅实际帖子。
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
I would like to randomly iterate through a range. Each value will be visited only once and all values will eventually be visited. For example:
class Array
def shuffle
ret = dup
j = length
i = 0
while j > 1
r = i + rand(j)
ret[i], ret[r] = ret[r], ret[i]
i += 1
j -= 1
end
ret
end
end
(0..9).to_a.shuffle.each{|x| f(x)}
where f(x)
is some function that operates on each value. A Fisher-Yates shuffle is used to efficiently provide random ordering.
My problem is that shuffle
needs to operate on an array, which is not cool because I am working with astronomically large numbers. Ruby will quickly consume a large amount of RAM trying to create a monstrous array. Imagine replacing (0..9)
with (0..99**99)
. This is also why the following code will not work:
tried = {} # store previous attempts
bigint = 99**99
bigint.times {
x = rand(bigint)
redo if tried[x]
tried[x] = true
f(x) # some function
}
This code is very naive and quickly runs out of memory as tried
obtains more entries.
What sort of algorithm can accomplish what I am trying to do?
[Edit1]: Why do I want to do this? I'm trying to exhaust the search space of a hash algorithm for a N-length input string looking for partial collisions. Each number I generate is equivalent to a unique input string, entropy and all. Basically, I'm "counting" using a custom alphabet.
[Edit2]: This means that f(x)
in the above examples is a method that generates a hash and compares it to a constant, target hash for partial collisions. I do not need to store the value of x
after I call f(x)
so memory should remain constant over time.
[Edit3/4/5/6]: Further clarification/fixes.
[Solution]: The following code is based on @bta's solution. For the sake of conciseness, next_prime
is not shown. It produces acceptable randomness and only visits each number once. See the actual post for more details.
N = size_of_range
Q = ( 2 * N / (1 + Math.sqrt(5)) ).to_i.next_prime
START = rand(N)
x = START
nil until f( x = (x + Q) % N ) == START # assuming f(x) returns x
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
我刚刚想起几年前上过的一堂课上有一个类似的问题;也就是说,在给定极其严格的内存限制的情况下,(相对)随机地迭代一组(完全耗尽它)。如果我没记错的话,我们的解决方案算法是这样的:
某个数字
N
N
x[0]
code>
N
的迭代器Q
x[n]
通过添加Q
到前一点并在需要时进行回绕。那
即,
x[n+1] = (x[n] + Q) % N
诀窍是找到一个迭代器,它可以让您遍历整个范围,而不会两次生成相同的值。如果我没记错的话,任何相对质数的
N
和Q
都可以工作(数字越接近范围的边界,输入的“随机”程度就越小)。在这种情况下,不是N
因数的质数应该有效。您还可以交换结果数字中的字节/半字节,以更改生成的点在N
中“跳跃”的模式。该算法只需要起点(
x[0]
)、当前点(x[n]
)、迭代器值(Q
),以及要存储的范围限制 (N
)。也许其他人记得这个算法并可以验证我是否记得正确?
I just remembered a similar problem from a class I took years ago; that is, iterating (relatively) randomly through a set (completely exhausting it) given extremely tight memory constraints. If I'm remembering this correctly, our solution algorithm was something like this:
some number
N
x[0]
insideN
Q
less thanN
x[n]
by addingQ
tothe previous point and wrapping around if needed. That
is,
x[n+1] = (x[n] + Q) % N
The trick is to find an iterator that will let you traverse the entire range without generating the same value twice. If I'm remembering correctly, any relatively prime
N
andQ
will work (the closer the number to the bounds of the range the less 'random' the input). In that case, a prime number that is not a factor ofN
should work. You can also swap bytes/nibbles in the resulting number to change the pattern with which the generated points "jump around" inN
.This algorithm only requires the starting point (
x[0]
), the current point (x[n]
), the iterator value (Q
), and the range limit (N
) to be stored.Perhaps someone else remembers this algorithm and can verify if I'm remembering it correctly?
正如@Turtle 回答的那样,你的问题没有解决方案。 @KandadaBoggu 和 @bta 解决方案为您提供随机数,这些数字是随机的或非随机的。你会得到一串数字。
但我不知道你为什么关心同一个数字的两次出现。如果
(0..99**99)
是您的范围,那么您是否可以每秒生成 10^10 个随机数(如果您有一个 3 GHz 处理器和大约 4 个内核,您可以在其中生成一个随机数)每个CPU周期的随机数 - 这是不可能的,Ruby甚至会减慢它很多),那么大约需要10^180年来耗尽所有数字。全年中生成两个相同数字的概率约为 10^-180。我们的宇宙大概有10^9年,所以如果你的计算机可以在时间开始时开始计算,那么你将有大约10^-170的概率生成两个相同的数字。换句话说 - 实际上这是不可能的并且您不必关心它。即使您仅使用 Jaguar(www.top500.org 超级计算机中的前 1 名)执行这一项任务,您仍然需要 10^174 年才能获得所有数字。
如果你不相信我,试试看,
如果你看到“哦,不!”,我就给你买瓶啤酒。在你有生之年在你的屏幕上:)
As @Turtle answered, you problem doesn't have a solution. @KandadaBoggu and @bta solution gives you random numbers is some ranges which are or are not random. You get clusters of numbers.
But I don't know why you care about double occurence of the same number. If
(0..99**99)
is your range, then if you could generate 10^10 random numbers per second (if you have a 3 GHz processor and about 4 cores on which you generate one random number per CPU cycle - which is imposible, and ruby will even slow it down a lot), then it would take about 10^180 years to exhaust all the numbers. You have also probability about 10^-180 that two identical numbers will be generated during a whole year. Our universe has probably about 10^9 years, so if your computer could start calculation when the time began, then you would have probability about 10^-170 that two identical numbers were generated. In the other words - practicaly it is imposible and you don't have to care about it.Even if you would use Jaguar (top 1 from www.top500.org supercomputers) with only this one task, you still need 10^174 years to get all numbers.
If you don't belive me, try
I'll buy you a beer if you will even once see "Oh, no!" on your screen during your life time :)
我可能是错的,但我认为如果不存储某些状态这是不可能的。至少,你需要一些状态。
即使每个值只使用一位(该值是否被尝试过),那么您将需要 X/8 字节的内存来存储结果(其中 X 是最大的数字)。假设您有 2GB 的可用内存,这将为您留下超过 1600 万个数字。
I could be wrong, but I don't think this is doable without storing some state. At the very least, you're going to need some state.
Even if you only use one bit per value (has this value been tried yes or no) then you will need X/8 bytes of memory to store the result (where X is the largest number). Assuming that you have 2GB of free memory, this would leave you with more than 16 million numbers.
将范围分解为可管理的批次,如下所示:
您可以通过随机选择要处理的批次来进一步随机化解决方案。
PS:对于map-reduce来说这是一个很好的问题。每个批次可以由独立的节点来工作。
参考:
Ruby 中的 Map-reduce
Break the range in to manageable batches as shown below:
You can further randomize solution by randomly choosing the batch for processing.
PS: This is a good problem for map-reduce. Each batch can be worked by independent nodes.
Reference:
Map-reduce in Ruby
您可以使用 shuffle 方法随机迭代数组
you can randomly iterate an array with shuffle method
你想要一个所谓的“全循环迭代器”...
这是最简单版本的伪代码,对于大多数用途来说都是完美的...
如果你这样称呼它:
它将生成随机数,循环遍历所有 10 个,从不重复 If你改变random_seed,它可以是任何东西,或者prime_number,它必须大于sample_size,并且不能被sample_size整除,你将得到一个新的随机顺序,但你仍然不会得到重复的。
You want what's called a "full cycle iterator"...
Here is psudocode for the simplest version which is perfect for most uses...
If you call this like so:
It would generate random numbers, looping through all 10, never repeating If you change random_seed, which can be anything, or prime_number, which must be greater than, and not be evenly divisible by sample_size, you will get a new random order, but you will still never get a duplicate.
数据库系统和其他大型系统通过将递归排序的中间结果写入临时数据库文件来实现此目的。这样,他们就可以对大量记录进行排序,同时在内存中每次只保留有限数量的记录。这在实践中往往很复杂。
Database systems and other large-scale systems do this by writing the intermediate results of recursive sorts to a temp database file. That way, they can sort massive numbers of records while only keeping limited numbers of records in memory at any one time. This tends to be complicated in practice.
您的订单必须有多“随机”?如果您不需要特定的输入分布,您可以尝试像这样的递归方案来最大限度地减少内存使用:
本质上,您是通过一次随机生成一位数字来构建索引。在最坏的情况下,这将需要足够的内存来存储 10 *(位数)。您将恰好遇到
(0..(10**3))
范围内的每个数字一次,但顺序只是伪随机的。也就是说,如果第一个循环设置a=1
,那么在看到百位数字变化之前,您将遇到1xx
形式的所有三位数。另一个缺点是需要手动将函数构建到指定的深度。在您的
(0..(99**99))
情况下,这可能是一个问题(尽管我认为您可以编写一个脚本来为您生成代码)。我确信可能有一种方法可以以有状态的递归方式重写它,但我无法立即想到它(想法,有人吗?)。How "random" does your order have to be? If you don't need a specific input distribution, you could try a recursive scheme like this to minimize memory usage:
Essentially, you are constructing the index by randomly generating one digit at a time. In the worst-case scenario, this will require enough memory to store 10 * (number of digits). You will encounter every number in the range
(0..(10**3))
exactly once, but the order is only pseudo-random. That is, if the first loop setsa=1
, then you will encounter all three-digit numbers of the form1xx
before you see the hundreds digit change.The other downside is the need to manually construct the function to a specified depth. In your
(0..(99**99))
case, this would likely be a problem (although I suppose you could write a script to generate the code for you). I'm sure there's probably a way to re-write this in a state-ful, recursive manner, but I can't think of it off the top of my head (ideas, anyone?).[编辑]:考虑到@klew和@Turtle的答案,我能期望的最好的结果就是批量的随机(或接近随机)数字。
这是类似于 KandadaBoggu 解决方案的递归实现。基本上,搜索空间(作为范围)被划分为包含 N 个大小相等的范围的数组。每个范围都以随机顺序反馈作为新的搜索空间。这一直持续到范围的大小达到下限。此时范围已经足够小,可以转换为数组,进行混洗和检查。
尽管它是递归的,但我还没有炸毁堆栈。相反,当尝试对大于大约
10^19
键的搜索空间进行分区时,它会出错。我需要处理的问题是数字太大而无法转换为long
。它可能可以修复:我希望代码注释有助于阐明我最初的问题。
pastebin:完整源代码
注意:
# options
下的PW_LEN
可以更改为较小的数字,以便更快地获得结果。[Edit]: Taking into account @klew and @Turtle's answers, the best I can hope for is batches of random (or close to random) numbers.
This is a recursive implementation of something similar to KandadaBoggu's solution. Basically, the search space (as a range) is partitioned into an array containing N equal-sized ranges. Each range is fed back in a random order as a new search space. This continues until the size of the range hits a lower bound. At this point the range is small enough to be converted into an array, shuffled, and checked.
Even though it is recursive, I haven't blown the stack yet. Instead, it errors out when attempting to partition a search space larger than about
10^19
keys. I has to do with the numbers being too large to convert to along
. It can probably be fixed:I hope the code comments help shed some light on my original question.
pastebin: full source
Note:
PW_LEN
under# options
can be changed to a lower number in order to get quicker results.对于非常大的空间,例如
您可以将此方法添加到
Range
中。您就可以
只要您的空间比 M127 小几个数量级,
具有良好的随机性。归功于 @nick-steele 和 @bta 的方法。
For a prohibitively large space, like
You can add this method to
Range
.You could then
With a good amount of randomness so long as your space is a few orders smaller than M127.
Credit to @nick-steele and @bta for the approach.
这并不是一个真正针对 Ruby 的答案,但我希望它被允许。 Andrew Kensler 提供了一个 C++“permute()”函数,该函数在他的 "Corlated Multi -抖动采样”报告。
据我了解,他提供的确切函数仅在您的“数组”大小达到 2^27 时才有效,但一般思想可用于任何大小的数组。
我会尽力解释它。第一部分是您需要一个“对于任何二次幂大小的域”可逆的哈希。考虑 x = i + 1。不管 x 是什么,即使你的整数溢出,你也可以确定 i 是什么。更具体地说,您始终可以从 x 的底部 n 位确定 i 的底部 n 位。加法是可逆哈希运算,乘以奇数也是如此,与常数进行按位异或也是如此。如果您知道特定的二次幂域,则可以对该域中的位进行置乱。例如
x^=(x&0xFF)>>> 5)
对于16位域有效。您可以使用掩码指定该域,例如mask = 0xFF
,您的哈希函数将变为x = hash(i, mask)
。当然,您可以将“种子”值添加到该哈希函数中以获得不同的随机化。肯斯勒在论文中列出了更有效的操作。所以你有一个可逆函数
x = hash(i, mask, seeds)
。问题是,如果您对索引进行散列,您最终可能会得到一个大于数组大小(即您的“域”)的值。你不能只对它取模,否则会发生冲突。可逆哈希是使用“循环行走”技术的关键,该技术在“具有任意有限域的密码"。因为哈希是可逆的(即 1 对 1),所以您可以重复应用相同的哈希,直到哈希值小于数组!因为您应用的是相同的哈希值,并且映射是一对一的,所以无论您最终得到的值是什么,都将准确地映射回一个索引,因此不会发生冲突。因此,对于 32 位整数(伪代码),您的函数可能看起来像这样:
它可能需要大量哈希值才能到达您的域,因此 Kensler 做了一个简单的技巧:他将哈希值保留在 2 的下一个幂的域内,通过屏蔽掉不必要的位,这使得它需要很少的迭代(平均约 2 次)。最终的算法如下所示:
就是这样!显然,这里重要的是选择一个好的哈希函数,肯斯勒在论文中提供了这个函数,但我想分解一下解释。如果您希望每次都有不同的随机排列,您可以向排列函数添加一个“种子”值,然后将该值传递给哈希函数。
This isn't really a Ruby-specific answer but I hope it's permitted. Andrew Kensler gives a C++ "permute()" function that does exactly this in his "Correlated Multi-Jittered Sampling" report.
As I understand it, the exact function he provides really only works if your "array" is up to size 2^27, but the general idea could be used for arrays of any size.
I'll do my best to sort of explain it. The first part is you need a hash that is reversible "for any power-of-two sized domain". Consider
x = i + 1
. No matter what x is, even if your integer overflows, you can determine what i was. More specifically, you can always determine the bottom n-bits of i from the bottom n-bits of x. Addition is a reversible hash operation, as is multiplication by an odd number, as is doing a bitwise xor by a constant. If you know a specific power-of-two domain, you can scramble bits in that domain. E.g.x ^= (x & 0xFF) >> 5)
is valid for the 16-bit domain. You can specify that domain with a mask, e.g.mask = 0xFF
, and your hash function becomesx = hash(i, mask)
. Of course you can add a "seed" value into that hash function to get different randomizations. Kensler lays out more valid operations in the paper.So you have a reversible function
x = hash(i, mask, seed)
. The problem is that if you hash your index, you might end up with a value that is larger than your array size, i.e. your "domain". You can't just modulo this or you'll get collisions.The reversible hash is the key to using a technique called "cycle walking", introduced in "Ciphers with Arbitrary Finite Domains". Because the hash is reversible (i.e. 1-to-1), you can just repeatedly apply the same hash until your hashed value is smaller than your array! Because you're applying the same hash, and the mapping is one-to-one, whatever value you end up on will map back to exactly one index, so you don't have collisions. So your function could look something like this for 32-bit integers (pseudocode):
It could take a lot of hashes to get to your domain, so Kensler does a simple trick: he keeps the hash within the domain of the next power of two, which makes it require very few iterations (~2 on average), by masking out the unnecessary bits. The final algorithm looks like this:
And that's it! Obviously the important thing here is choosing a good hash function, which Kensler provides in the paper but I wanted to break down the explanation. If you want to have different random permutations each time, you can add a "seed" value to the permute function which then gets passed to the hash function.