生成一个不在 40 亿给定整数之中的整数
我收到了这样一个面试问题:
给定一个包含 40 亿个整数的输入文件,提供一种算法来生成文件中未包含的整数。假设您有 1 GB 内存。跟进如果您只有 10 MB 内存您会做什么。
我的分析:
文件大小为4×109×4字节=16GB。
我们可以进行外部排序,从而让我们知道整数的范围。
我的问题是检测排序大整数集中丢失的整数的最佳方法是什么?
我的理解(阅读所有答案后):
假设我们谈论的是 32 位整数,则有 232 = 4*109 个不同的整数。
情况 1:我们有 1 GB = 1 * 109 * 8 位 = 80 亿位内存。
解决方案:
如果我们用一位代表一个不同的整数,就足够了。我们不需要排序。
实现:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
情况 2:10MB 内存 = 10 * 106 * 8 位 = 8000 万位
解决方案:
对于所有可能的 16 位前缀,有 216 个 整数 = 65536,我们需要 216 * 4 * 8 = 200 万位。我们需要构建 65536 个存储桶。对于每个桶,我们需要 4 个字节来保存所有可能性,因为最坏的情况是所有 40 亿个整数都属于同一个桶。
- 通过第一次遍历文件构建每个存储桶的计数器。
- 扫描桶,找到第一个命中数小于 65536 的桶。
- 构建新的存储桶,其高 16 位前缀是我们在步骤 2 中找到的 通过文件的第二遍
- 扫描步骤3中构建的桶,找到第一个没有的桶 取得成功。
该代码与上面的代码非常相似。
结论: 我们通过增加文件传递来减少内存。
对那些迟到的人的澄清:所提出的问题并没有说文件中不包含确切的一个整数 - 至少大多数人不是这样解释它的。不过,评论线程中的许多评论都是关于任务的这种变化。不幸的是,将其引入评论线程的评论后来被其作者删除,所以现在看来,对它的孤立回复只是误解了一切。这很令人困惑,抱歉。
I have been given this interview question:
Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.
My analysis:
The size of the file is 4×109×4 bytes = 16 GB.
We can do external sorting, thus letting us know the range of the integers.
My question is what is the best way to detect the missing integer in the sorted big integer sets?
My understanding (after reading all the answers):
Assuming we are talking about 32-bit integers, there are 232 = 4*109 distinct integers.
Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.
Solution:
If we use one bit representing one distinct integer, it is enough. we don't need sort.
Implementation:
int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
Scanner in = new Scanner(new FileReader("a.txt"));
while(in.hasNextInt()){
int n = in.nextInt();
bitfield[n/radix] |= (1 << (n%radix));
}
for(int i = 0; i< bitfield.lenght; i++){
for(int j =0; j<radix; j++){
if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
}
}
}
Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits
Solution:
For all possible 16-bit prefixes, there are 216 number of
integers = 65536, we need 216 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
- Build the counter of each bucket through the first pass through the file.
- Scan the buckets, find the first one who has less than 65536 hit.
- Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file- Scan the buckets built in step3, find the first bucket which doesnt
have a hit.The code is very similar to above one.
Conclusion:
We decrease memory through increasing file pass.
A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
假设“整数”表示 32 位:对于所有可能的 16 位,10 MB 的空间足以让您计算输入文件中具有任何给定 16 位前缀的数字有多少个位前缀一次通过输入文件。至少其中一个桶的撞击次数少于 216 次。进行第二遍以查找该存储桶中哪些可能的数字已被使用。
如果它意味着超过 32 位,但仍具有有限大小:按照上述操作,忽略所有恰好落在(有符号或无符号;您的选择)32 位范围之外的输入数字。
如果“整数”表示数学整数:通读一次输入并跟踪您见过的最长数字的
最大数字长度。完成后,输出最大加一一个多一位的随机数。 (文件中的一个数字可能是一个大数,需要超过 10 MB 才能准确表示,但如果输入是一个文件,那么您至少可以表示适合的任何内容的长度它)。Assuming that "integer" means 32 bits: 10 MB of space is more than enough for you to count how many numbers there are in the input file with any given 16-bit prefix, for all possible 16-bit prefixes in one pass through the input file. At least one of the buckets will have be hit less than 216 times. Do a second pass to find of which of the possible numbers in that bucket are used already.
If it means more than 32 bits, but still of bounded size: Do as above, ignoring all input numbers that happen to fall outside the (signed or unsigned; your choice) 32-bit range.
If "integer" means mathematical integer: Read through the input once and keep track of the
largest numberlength of the longest number you've ever seen. When you're done, outputthe maximum plus onea random number that has one more digit. (One of the numbers in the file may be a bignum that takes more than 10 MB to represent exactly, but if the input is a file, then you can at least represent the length of anything that fits in it).统计信息算法使用比确定性方法更少的遍数来解决这个问题。
如果允许非常大的整数,那么就可以在 O(1) 时间内生成一个可能唯一的数字。像 GUID 这样的伪随机 128 位整数只会与现有的 40 亿个中的一个发生冲突集合中的整数不到每 640 亿个案例中的一个。
如果整数限制为 32 位,则可以使用远小于 10 MB 的空间在一次传递中生成一个可能唯一的数字。伪随机 32 位整数与 40 亿个现有整数之一发生冲突的几率约为 93% (4e9 / 2^32)。 1000 个伪随机整数全部发生碰撞的几率小于 120,000 亿亿分之一(一次碰撞的几率 ^ 1000)。因此,如果一个程序维护一个包含 1000 个伪随机候选的数据结构,并迭代已知的整数,消除候选中的匹配项,那么几乎肯定会找到至少一个不在文件中的整数。
Statistically informed algorithms solve this problem using fewer passes than deterministic approaches.
If very large integers are allowed then one can generate a number that is likely to be unique in O(1) time. A pseudo-random 128-bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases.
If integers are limited to 32 bits then one can generate a number that is likely to be unique in a single pass using much less than 10 MB. The odds that a pseudo-random 32-bit integer will collide with one of the 4 billion existing integers is about 93% (4e9 / 2^32). The odds that 1000 pseudo-random integers will all collide is less than one in 12,000 billion billion billion (odds-of-one-collision ^ 1000). So if a program maintains a data structure containing 1000 pseudo-random candidates and iterates through the known integers, eliminating matches from the candidates, it is all but certain to find at least one integer that is not in the file.
Jon Bentley“第 1 栏:破解牡蛎”中对此问题进行了详细讨论< em>编程珍珠 Addison-Wesley pp.3-10
Bentley讨论了几种方法,包括外部排序、使用多个外部文件的合并排序等,但最好的方法Bentley 建议使用位字段的单遍算法,他幽默地将其称为“Wonder Sort”:)
谈到这个问题,40 亿个数字可以用以下形式表示:
实现位集的代码很简单:(取自 解决方案页面 )
Bentley 的算法对文件进行单次传递,
设置
数组中的适当位,然后使用上面的test
宏可以找到丢失的号码。如果可用内存小于 0.466 GB,Bentley 建议使用 k-pass 算法,该算法根据可用内存将输入划分为多个范围。举一个非常简单的例子,如果只有1个字节(即处理8个数字的内存)可用,并且范围是从0到31,我们将其分为0到7、8-15、16-22等范围并在每次
32/8 = 4
遍中处理这个范围。HTH。
A detailed discussion on this problem has been discussed in Jon Bentley "Column 1. Cracking the Oyster" Programming Pearls Addison-Wesley pp.3-10
Bentley discusses several approaches, including external sort, Merge Sort using several external files etc., But the best method Bentley suggests is a single pass algorithm using bit fields, which he humorously calls "Wonder Sort" :)
Coming to the problem, 4 billion numbers can be represented in :
The code to implement the bitset is simple: (taken from solutions page )
Bentley's algorithm makes a single pass over the file,
set
ting the appropriate bit in the array and then examines this array usingtest
macro above to find the missing number.If the available memory is less than 0.466 GB, Bentley suggests a k-pass algorithm, which divides the input into ranges depending on available memory. To take a very simple example, if only 1 byte (i.e memory to handle 8 numbers ) was available and the range was from 0 to 31, we divide this into ranges of 0 to 7, 8-15, 16-22 and so on and handle this range in each of
32/8 = 4
passes.HTH.
由于问题没有指定我们必须找到文件中没有的最小可能数字,因此我们可以生成一个比输入文件本身长的数字。 :)
Since the problem does not specify that we have to find the smallest possible number that is not in the file we could just generate a number that is longer than the input file itself. :)
对于 1 GB RAM 变体,您可以使用位向量。您需要分配 40 亿位 == 500 MB 字节数组。对于从输入中读取的每个数字,将相应的位设置为“1”。完成后,迭代这些位,找到第一个仍为“0”的位。它的索引就是答案。
For the 1 GB RAM variant you can use a bit vector. You need to allocate 4 billion bits == 500 MB byte array. For each number you read from the input, set the corresponding bit to '1'. Once you done, iterate over the bits, find the first one that is still '0'. Its index is the answer.
如果它们是 32 位整数(可能是从接近 232 的约 40 亿个数字中选择的),则您的 40 亿个数字列表将最多占据可能整数的 93% (4 * 109 / (232) )。因此,如果您创建一个 232 位的位数组,每个位都初始化为零(这将占用 229 字节 ~ 500 MB RAM;记住一个字节 = 23 位 = 8 位),读取整数列表,并为每个 int 设置相应的位数组元素(从 0 到 1);然后读取您的位数组并返回第一个仍为 0 的位
。如果您的 RAM 较少(~10 MB),则需要稍微修改此解决方案。 10 MB ~ 83886080 位仍然足以为 0 到 83886079 之间的所有数字创建位数组。因此您可以通读整数列表;并且仅在位数组中记录 0 到 83886079 之间的 #。如果数字是随机分布的;具有压倒性的概率(大约相差 100% 10-2592069)你会发现一个丢失的int)。事实上,如果您只选择数字 1 到 2048(只有 256 字节的 RAM),您仍然会发现丢失的数字占绝大多数(99.99999999999999999999999999999999999999999999999999999999999995%)。
但我们假设数字不是大约 40 亿;您有类似 232 - 1 个数字和小于 10 MB 的 RAM;所以任何小范围的整数只有很小的可能性不包含该数字。
如果保证列表中的每个 int 都是唯一的,则可以将数字相加,然后将缺少一个 # 的总和减去总和 (½)(232)(232 - 1) = 9223372034707292160 找到丢失的 int。但是,如果 int 出现两次,则此方法将失败。
然而,你总是可以分而治之。一种简单的方法是读取数组并计算前半部分(0 到 231-1)和后半部分(231-1)的数字数量。 >, 232)。然后选择数字较少的范围,并重复将该范围一分为二。 (假设 (231, 232) 中少了两个数字,那么您的下一次搜索将计算范围 (231) 内的数字>, 3*230-1), (3*230, 232),直到找到零的范围。应该需要 O(lg N) ~ 32 次读取数组。
我们在每个步骤中只使用两个整数(或者大约 8 字节的 RAM 和 4 字节(32 位)。 ) 更好的方法是分成 sqrt(232) = 216 = 65536 个 bin,每个 bin 中有 65536 个数字。每个 bin 需要 4 个字节来存储其计数,因此您需要 218 字节 = 256 kB,因此 bin 0 为 (0 到 65535=216-1)。 ,bin 1 为 (216=65536 到 2*216-1=131071),bin 2是 (2*216=131072 到 3*216-1=196607)
。 ; 并计算 216 每个 bin 中有多少个整数,并找到不包含所有 65536 个数字的 incomplete_bin。然后你再读一遍这个40亿的整数列表;但这次只注意整数何时在该范围内;当你找到它们时稍微翻转一下。
If they are 32-bit integers (likely from the choice of ~4 billion numbers close to 232), your list of 4 billion numbers will take up at most 93% of the possible integers (4 * 109 / (232) ). So if you create a bit-array of 232 bits with each bit initialized to zero (which will take up 229 bytes ~ 500 MB of RAM; remember a byte = 23 bits = 8 bits), read through your integer list and for each int set the corresponding bit-array element from 0 to 1; and then read through your bit-array and return the first bit that's still 0.
In the case where you have less RAM (~10 MB), this solution needs to be slightly modified. 10 MB ~ 83886080 bits is still enough to do a bit-array for all numbers between 0 and 83886079. So you could read through your list of ints; and only record #s that are between 0 and 83886079 in your bit array. If the numbers are randomly distributed; with overwhelming probability (it differs by 100% by about 10-2592069) you will find a missing int). In fact, if you only choose numbers 1 to 2048 (with only 256 bytes of RAM) you'd still find a missing number an overwhelming percentage (99.99999999999999999999999999999999999999999999999999999999999995%) of the time.
But let's say instead of having about 4 billion numbers; you had something like 232 - 1 numbers and less than 10 MB of RAM; so any small range of ints only has a small possibility of not containing the number.
If you were guaranteed that each int in the list was unique, you could sum the numbers and subtract the sum with one # missing to the full sum (½)(232)(232 - 1) = 9223372034707292160 to find the missing int. However, if an int occurred twice this method will fail.
However, you can always divide and conquer. A naive method, would be to read through the array and count the number of numbers that are in the first half (0 to 231-1) and second half (231, 232). Then pick the range with fewer numbers and repeat dividing that range in half. (Say if there were two less number in (231, 232) then your next search would count the numbers in the range (231, 3*230-1), (3*230, 232). Keep repeating until you find a range with zero numbers and you have your answer. Should take O(lg N) ~ 32 reads through the array.
That method was inefficient. We are only using two integers in each step (or about 8 bytes of RAM with a 4 byte (32-bit) integer). A better method would be to divide into sqrt(232) = 216 = 65536 bins, each with 65536 numbers in a bin. Each bin requires 4 bytes to store its count, so you need 218 bytes = 256 kB. So bin 0 is (0 to 65535=216-1), bin 1 is (216=65536 to 2*216-1=131071), bin 2 is (2*216=131072 to 3*216-1=196607). In python you'd have something like:
Read through the ~4 billion integer list; and count how many ints fall in each of the 216 bins and find an incomplete_bin that doesn't have all 65536 numbers. Then you read through the 4 billion integer list again; but this time only notice when integers are in that range; flipping a bit when you find them.
为什么要把事情搞得这么复杂呢?您要求提供文件中不存在的整数吗?
根据指定的规则,您唯一需要存储的是迄今为止在文件中遇到的最大整数。读取整个文件后,返回比该值大 1 的数字。
不存在命中 maxint 或其他任何问题的风险,因为根据规则,对于整数的大小或算法返回的数字没有限制。
Why make it so complicated? You ask for an integer not present in the file?
According to the rules specified, the only thing you need to store is the largest integer that you encountered so far in the file. Once the entire file has been read, return a number 1 greater than that.
There is no risk of hitting maxint or anything, because according to the rules, there is no restriction to the size of the integer or the number returned by the algorithm.
这可以使用二分搜索的变体在很小的空间内解决。
从允许的数字范围开始,
0
到4294967295
。计算中点。
循环遍历文件,计算有多少个数字等于、小于或高于中点值。
如果没有相同的数字,那么就完成了。中点数字就是答案。
否则,选择数字最少的范围,并使用此新范围从步骤 2 重复。
这将需要对文件进行最多 32 次线性扫描,但它只会使用几个字节的内存来存储范围和计数。
这本质上与 Henning 的解决方案,只不过它使用两个 bin 而不是 16k。
This can be solved in very little space using a variant of binary search.
Start off with the allowed range of numbers,
0
to4294967295
.Calculate the midpoint.
Loop through the file, counting how many numbers were equal, less than or higher than the midpoint value.
If no numbers were equal, you're done. The midpoint number is the answer.
Otherwise, choose the range that had the fewest numbers and repeat from step 2 with this new range.
This will require up to 32 linear scans through the file, but it will only use a few bytes of memory for storing the range and the counts.
This is essentially the same as Henning's solution, except it uses two bins instead of 16k.
编辑 好吧,这并没有经过深思熟虑,因为它假设文件中的整数遵循某种静态分布。显然他们不需要这样做,但即便如此,人们也应该尝试一下这个:
有大约 43 亿个 32 位整数。我们不知道它们在文件中的分布情况,但最坏的情况是香农熵最高的情况:均匀分布。在这种情况下,文件中任何一个整数不出现的概率为
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
香农熵越低,平均概率就越高,但即使对于这种最坏的情况,我们有 90% 的机会找到 5 之后不出现的数字用随机整数进行猜测。只需使用伪随机生成器创建这样的数字,并将它们存储在列表中。然后依次读取 int 并将其与您的所有猜测进行比较。当存在匹配项时,删除此列表条目。浏览完所有文件后,您很可能会留下不止一种猜测。使用其中任何一个。在极少数情况下(即使在最坏的情况下也是 10%)无法猜测的情况下,获取一组新的随机整数,这次可能更多(10->99%)。
内存消耗:几十个字节,复杂度:O(n),开销:不可避免,因为大部分时间都花在不可避免的硬盘访问上,而不是比较整数。
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
EDIT Ok, this wasn't quite thought through as it assumes the integers in the file follow some static distribution. Apparently they don't need to, but even then one should try this:
There are ≈4.3 billion 32-bit integers. We don't know how they are distributed in the file, but the worst case is the one with the highest Shannon entropy: an equal distribution. In this case, the probablity for any one integer to not occur in the file is
( (2³²-1)/2³² )⁴ ⁰⁰⁰ ⁰⁰⁰ ⁰⁰⁰ ≈ .4
The lower the Shannon entropy, the higher this probability gets on the average, but even for this worst case we have a chance of 90% to find a nonoccurring number after 5 guesses with random integers. Just create such numbers with a pseudorandom generator, store them in a list. Then read int after int and compare it to all of your guesses. When there's a match, remove this list entry. After having been through all of the file, chances are you will have more than one guess left. Use any of them. In the rare (10% even at worst case) event of no guess remaining, get a new set of random integers, perhaps more this time (10->99%).
Memory consumption: a few dozen bytes, complexity: O(n), overhead: neclectable as most of the time will be spent in the unavoidable hard disk accesses rather than comparing ints anyway.
The actual worst case, when we do not assume a static distribution, is that every integer occurs max. once, because then only
1 - 4000000000/2³² ≈ 6%
of all integers don't occur in the file. So you'll need some more guesses, but that still won't cost hurtful amounts of memory.
如果 [0, 2^x - 1] 范围内缺少一个整数,则只需将它们全部异或即可。例如:(
我知道这并不能完全回答这个问题,但对于一个非常相似的问题来说,这是一个很好的答案。)
If you have one integer missing from the range [0, 2^x - 1] then just xor them all together. For example:
(I know this doesn't answer the question exactly, but it's a good answer to a very similar question.)
他们可能想看看您是否听说过概率布隆过滤器,它可以非常有效地绝对确定某个值是否不存在一个大集合的一部分,(但只能以高概率确定它是该集合的成员。)
They may be looking to see if you have heard of a probabilistic Bloom Filter which can very efficiently determine absolutely if a value is not part of a large set, (but can only determine with high probability it is a member of the set.)
根据原问题当前的措辞,最简单的解决方案是:
找到文件中的最大值,然后加1。
Based on the current wording in the original question, the simplest solution is:
Find the maximum value in the file, then add 1 to it.
使用
BitSet
。将 40 亿个整数(假设最多 2^32 个整数)按每字节 8 个打包到 BitSet 中,大小为 2^32 / 2^3 = 2^29 = 大约 0.5 Gb。要添加更多细节 - 每次读取数字时,请在 BitSet 中设置相应的位。然后,遍历 BitSet 以查找第一个不存在的数字。事实上,您可以通过重复选择随机数并测试它是否存在来同样有效地做到这一点。
实际上 BitSet.nextClearBit(0) 会告诉您第一个未设置的位。
查看 BitSet API,它似乎只支持 0..MAX_INT,因此您可能需要 2 个 BitSets - 一个用于 +'ve 数字,一个用于 -ve 数字 - 但内存要求不会改变。
Use a
BitSet
. 4 billion integers (assuming up to 2^32 integers) packed into a BitSet at 8 per byte is 2^32 / 2^3 = 2^29 = approx 0.5 Gb.To add a bit more detail - every time you read a number, set the corresponding bit in the BitSet. Then, do a pass over the BitSet to find the first number that's not present. In fact, you could do this just as effectively by repeatedly picking a random number and testing if it's present.
Actually BitSet.nextClearBit(0) will tell you the first non-set bit.
Looking at the BitSet API, it appears to only support 0..MAX_INT, so you may need 2 BitSets - one for +'ve numbers and one for -'ve numbers - but the memory requirements don't change.
如果没有大小限制,最快的方法是获取文件的长度,并生成文件的长度+1个随机数字(或只是“11111...”)。优点:您甚至不需要读取文件,并且可以将内存使用量降至几乎为零。缺点:您将打印数十亿个数字。
但是,如果唯一的因素是最大限度地减少内存使用,而其他都不重要,那么这将是最佳解决方案。它甚至可能会让你获得“最严重滥用规则”奖。
If there is no size limit, the quickest way is to take the length of the file, and generate the length of the file+1 number of random digits (or just "11111..." s). Advantage: you don't even need to read the file, and you can minimize memory use nearly to zero. Disadvantage: You will print billions of digits.
However, if the only factor was minimizing memory usage, and nothing else is important, this would be the optimal solution. It might even get you a "worst abuse of the rules" award.
如果我们假设数字的范围始终为 2^n(2 的偶次幂),则异或将起作用(如另一张海报所示)。至于为什么,让我们来证明一下:
理论
给定任何基于 0 的整数范围,其中包含
2^n
元素且缺少一个元素,您可以通过简单地对已知值进行异或来找到缺少的元素一起得出缺失的数字。证明
让我们看一下 n = 2。对于 n=2,我们可以表示 4 个唯一的整数:0、1、2、3。它们的位模式为:
现在,如果我们观察的话,每一位都被设置了两次。因此,由于它被设置为偶数次,因此数字的异或将产生 0。如果缺少单个数字,则异或将产生一个数字,当与丢失的数字进行异或时将产生0。因此,缺失的数与异或所得的数完全相同。如果我们删除 2,则生成的异或将为 10(或 2)。
现在,让我们看看n+1。让我们分别调用
n
、x
中每个位被设置的次数以及n+1
中每个位被设置的次数y
。y
的值将等于y = x * 2
,因为有x
个元素的n+1
> 位设置为 0,x
元素的n+1
位设置为 1。由于2x
始终为偶数,n+1
将始终使每个位设置偶数次。因此,由于
n=2
有效,并且n+1
有效,因此异或方法将适用于n>=2
的所有值。基于 0 的范围的算法
这非常简单。它使用 2*n 位内存,因此对于任何范围 <= 32,2 个 32 位整数都可以工作(忽略文件描述符消耗的任何内存)。它对文件进行了一次传递。
任意范围的算法
该算法适用于任何起始数字到任何结束数字的范围,只要总范围等于 2^n...这基本上将范围重新设置为最小值为 0。但它确实需要两次遍历文件(第一次获取最小值,第二次计算丢失的整数)。
任意范围
我们可以将此修改后的方法应用于一组任意范围,因为所有范围都会至少跨越 2^n 的幂一次。仅当有一个缺失位时,此方法才有效。它需要对未排序的文件进行 2 次传递,但每次都会找到单个缺失的数字:
基本上,将范围重新设置为 0 附近。然后,它在计算异或时计算要附加的未排序值的数量。然后,它将未排序值的计数加 1 以处理缺失值(计算缺失值)。然后,继续对 n 值进行异或,每次增加 1,直到 n 是 2 的幂。然后将结果重新基回原始基数。完毕。
这是我在 PHP 中测试的算法(使用数组而不是文件,但概念相同):
将任意范围的值(我测试过包括负数)放入一个数组中,其中一个在该范围内缺失,它找到了正确的值每次。
另一种方法
既然我们可以使用外部排序,为什么不直接检查间隙呢?如果我们假设文件在运行该算法之前已排序:
If we assume that the range of numbers will always be 2^n (an even power of 2), then exclusive-or will work (as shown by another poster). As far as why, let's prove it:
The Theory
Given any 0 based range of integers that has
2^n
elements with one element missing, you can find that missing element by simply xor-ing the known values together to yield the missing number.The Proof
Let's look at n = 2. For n=2, we can represent 4 unique integers: 0, 1, 2, 3. They have a bit pattern of:
Now, if we look, each and every bit is set exactly twice. Therefore, since it is set an even number of times, and exclusive-or of the numbers will yield 0. If a single number is missing, the exclusive-or will yield a number that when exclusive-ored with the missing number will result in 0. Therefore, the missing number, and the resulting exclusive-ored number are exactly the same. If we remove 2, the resulting xor will be
10
(or 2).Now, let's look at n+1. Let's call the number of times each bit is set in
n
,x
and the number of times each bit is set inn+1
y
. The value ofy
will be equal toy = x * 2
because there arex
elements with then+1
bit set to 0, andx
elements with then+1
bit set to 1. And since2x
will always be even,n+1
will always have each bit set an even number of times.Therefore, since
n=2
works, andn+1
works, the xor method will work for all values ofn>=2
.The Algorithm For 0 Based Ranges
This is quite simple. It uses 2*n bits of memory, so for any range <= 32, 2 32 bit integers will work (ignoring any memory consumed by the file descriptor). And it makes a single pass of the file.
The Algorithm For Arbitrary Based Ranges
This algorithm will work for ranges of any starting number to any ending number, as long as the total range is equal to 2^n... This basically re-bases the range to have the minimum at 0. But it does require 2 passes through the file (the first to grab the minimum, the second to compute the missing int).
Arbitrary Ranges
We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:
Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. Done.
Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):
Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.
Another Approach
Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:
棘手的问题,除非引用不当。只需读取一次文件即可获取最大整数
n
,并返回n+1
。当然,您需要一个备份计划,以防
n+1
导致整数溢出。Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer
n
, and returnn+1
.Of course you'd need a backup plan in case
n+1
causes an integer overflow.检查输入文件的大小,然后输出任何数字,该数字太大而无法用该大小的文件表示。这似乎是一个廉价的技巧,但它是一个面试问题的创造性解决方案,它巧妙地回避了内存问题,而且从技术上讲它是 O(n) 的。
应打印 10 bitcount - 1,它始终大于 2 bitcount。从技术上讲,您必须击败的数字是 2 bitcount - (4 * 109 - 1),因为您知道有 (40 亿- 1) 文件中的其他整数,即使完美压缩,它们也将至少占用一位。
Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).
Should print 10 bitcount - 1, which will always be greater than 2 bitcount. Technically, the number you have to beat is 2 bitcount - (4 * 109 - 1), since you know there are (4 billion - 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.
最简单的方法是找到文件中的最小数字,并返回比该数字少的 1。对于包含 n 个数字的文件,这使用 O(1) 存储空间和 O(n) 时间。但是,如果数字范围有限,则会失败,这可能会使 min-1 不是数字。
使用位图的简单直接的方法已经提到过。该方法使用 O(n) 时间和存储。
还提到了具有 2^16 个计数桶的 2-pass 方法。它读取 2*n 个整数,因此使用 O(n) 时间和 O(1) 存储,但它无法处理超过 2^16 个数字的数据集。然而,通过运行 4 次而不是 2 次,它可以轻松扩展到(例如)2^60 64 位整数,并且通过仅使用内存中适合的 bin 并相应地增加传递数量,可以轻松适应使用微小内存。这种情况下运行时间不再是 O(n) 而是 O(n*log n)。
将所有数字异或在一起的方法,迄今为止由 rfrankel 提到,并由 ircmaxell 详细回答了 stackoverflow#35185,正如 ltn100 指出的那样。它使用 O(1) 存储和 O(n) 运行时间。如果目前我们假设 32 位整数,XOR 有 7% 的概率产生不同的数字。基本原理:给定〜4G不同的数字异或在一起,并且大约。 300M不在文件中,每个位位置设置的位数有相同的奇数或偶数机会。因此,2^32 个数字作为 XOR 结果出现的可能性相同,其中 93% 已在文件中。请注意,如果文件中的数字并非全部不同,则 XOR 方法的成功概率会上升。
The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.
The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.
A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).
The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185, as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.
来自 Carbonetc 的 Reddit。
From Reddit by Carbonetc.
出于某种原因,我一读到这个问题就想到了对角化。我假设任意大的整数。
读出第一个数字。向左填充零位,直到达到 40 亿位。如果第一个(高位)位为0,则输出1;否则输出 0。(您实际上不必左填充:如果数字中没有足够的位,您只需输出 1。)对第二个数字执行相同的操作,除了使用其第二位。以这种方式继续浏览该文件。您将一次一位地输出 40 亿位数,并且该数字不会与文件中的任何数字相同。证明:它与第 n 个数字相同,那么他们会在第 n 位上达成一致,但通过构造却无法达成一致。
For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.
Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.
您可以使用位标志来标记整数是否存在。
遍历整个文件后,扫描每一位以确定该数字是否存在。
假设每个整数都是 32 位,如果完成了位标记,它们将方便地装入 1 GB RAM。
You can use bit flags to mark whether an integer is present or not.
After traversing the entire file, scan each bit to determine if the number exists or not.
Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.
为了完整起见,这是另一个非常简单的解决方案,它很可能需要很长时间才能运行,但使用的内存很少。
令所有可能的整数的范围为
int_min
到int_max
,并且bool isNotInFile(integer) 一个函数,如果文件不包含某个整数,则返回 true,否则返回 false(通过将该特定整数与文件中的每个整数进行比较)
Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.
Let all possible integers be the range from
int_min
toint_max
, andbool isNotInFile(integer)
a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)对于 10 MB 内存限制:
完成后,就走一条之前没有创建过的路径来创建请求的数字。
40 亿个数字 = 2^32,这意味着 10 MB 可能不够。
编辑
可以进行优化,如果两端叶子已创建并且具有共同的父级,则可以将它们删除并将父级标记为不是解决方案。这会减少分支并减少对内存的需求。
编辑 II
也不需要完全构建树。仅当数字相似时才需要构建深层分支。如果我们也砍掉树枝,那么这个解决方案实际上可能有效。
For the 10 MB memory constraint:
When finished, just take a path that has not been created before to create the requested number.
4 billion number = 2^32, meaning 10 MB might not be sufficient.
EDIT
An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.
EDIT II
There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.
我将回答1GB版本:
问题中没有足够的信息,所以我将首先陈述一些假设:
整数是32位,范围-2,147,483,648到2,147,483,647。
伪代码:
I will answer the 1 GB version:
There is not enough information in the question, so I will state some assumptions first:
The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.
Pseudo-code:
正如 Ryan 所说,基本上,对文件进行排序,然后遍历整数,当跳过一个值时,您就得到了它:)
编辑在反对者处:OP提到可以对文件进行排序,所以这是一个有效的方法。
As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it :)
EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.
只要我们在做创造性的答案,这里就有另一个答案。
使用外部排序程序对输入文件进行数字排序。这适用于您可能拥有的任何内存量(如果需要,它将使用文件存储)。
通读已排序的文件并输出第一个缺少的数字。
As long as we're doing creative answers, here is another one.
Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed).
Read through the sorted file and output the first number that is missing.
位消除
一种方法是消除位,但这实际上可能不会产生结果(很可能不会)。伪代码:
位计数
跟踪位计数;并使用数量最少的位来生成一个值。同样,这并不能保证生成正确的值。
范围逻辑
跟踪列表排序范围(按开始排序)。范围由结构定义:
遍历文件中的每个值并尝试将其从当前范围中删除。此方法没有内存保证,但效果应该不错。
Bit Elimination
One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:
Bit Counts
Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.
Range Logic
Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:
Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.
2128*1018 + 1(即 (28)16*1018< /sup> + 1 ) - 这不能成为今天的通用答案吗?这表示 16 EB 文件无法容纳的数字,这是当前任何文件系统中的最大文件大小。
2128*1018 + 1 ( which is (28)16*1018 + 1 ) - cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.
我认为这是一个已解决的问题(见上文),但有一个有趣的副作用需要记住,因为它可能会被问到:
如果正好有 4,294,967,295 (2^32 - 1) 32 位整数,没有重复,因此只缺少一个,有一个简单的解决方案。
从零开始运行总计,对于文件中的每个整数,添加具有 32 位溢出的整数(实际上,runningTotal = (runningTotal + nextInteger) % 4294967296)。完成后,将 4294967296/2 添加到运行总数中,再次出现 32 位溢出。用 4294967296 减去这个值,结果就是缺失的整数。
“仅缺少一个整数”问题只需运行一次即可解决,并且只有 64 位 RAM 专用于数据(32 位用于运行总计,32 位用于读取下一个整数)。
推论:如果我们不关心整数结果必须有多少位,则更通用的规范非常容易匹配。我们只是生成一个足够大的整数,它不能包含在我们给定的文件中。同样,这占用的 RAM 绝对是最小的。请参阅伪代码。
I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:
If there are exactly 4,294,967,295 (2^32 - 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.
Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.
The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).
Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.
如果您不假设 32 位约束,则只需返回随机生成的 64 位数字(如果您是悲观主义者,则返回 128 位)。发生碰撞的几率为
1 in 2^64/(4*10^9) = 4611686018.4
(大约为 40 亿分之一)。大多数时候你是对的!(开玩笑……有点。)
If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is
1 in 2^64/(4*10^9) = 4611686018.4
(roughly 1 in 4 billion). You'd be right most of the time!(Joking... kind of.)