当前位置：文江博客话题详情

生成一个不在 40 亿给定整数之中的整数

发布于 2024-11-30 18:20:52 字数 1720 浏览 2 评论 0原文

我收到了这样一个面试问题：

给定一个包含 40 亿个整数的输入文件，提供一种算法来生成文件中未包含的整数。假设您有 1 GB 内存。跟进如果您只有 10 MB 内存您会做什么。

我的分析：

文件大小为4×10⁹×4字节=16GB。

我们可以进行外部排序，从而让我们知道整数的范围。

我的问题是检测排序大整数集中丢失的整数的最佳方法是什么？

我的理解（阅读所有答案后）：

假设我们谈论的是 32 位整数，则有 2³² = 4*10⁹ 个不同的整数。

情况 1：我们有 1 GB = 1 * 10⁹ * 8 位 = 80 亿位内存。

解决方案：

如果我们用一位代表一个不同的整数，就足够了。我们不需要排序。

实现：

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

情况 2：10MB 内存 = 10 * 10⁶ * 8 位 = 8000 万位

解决方案：
对于所有可能的 16 位前缀，有 2¹⁶ 个整数 = 65536，我们需要 2¹⁶ * 4 * 8 = 200 万位。我们需要构建 65536 个存储桶。对于每个桶，我们需要 4 个字节来保存所有可能性，因为最坏的情况是所有 40 亿个整数都属于同一个桶。
通过第一次遍历文件构建每个存储桶的计数器。
扫描桶，找到第一个命中数小于 65536 的桶。
构建新的存储桶，其高 16 位前缀是我们在步骤 2 中找到的通过文件的第二遍
扫描步骤3中构建的桶，找到第一个没有的桶取得成功。
该代码与上面的代码非常相似。

结论：我们通过增加文件传递来减少内存。

^{对那些迟到的人的澄清：所提出的问题并没有说文件中不包含确切的一个整数 - 至少大多数人不是这样解释它的。不过，评论线程中的许多评论都是关于任务的这种变化。不幸的是，将其引入评论线程的评论后来被其作者删除，所以现在看来，对它的孤立回复只是误解了一切。这很令人困惑，抱歉。}

原文

I have been given this interview question:

Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB memory. Follow up with what you would do if you have only 10 MB of memory.

My analysis:

The size of the file is 4×10⁹×4 bytes = 16 GB.

We can do external sorting, thus letting us know the range of the integers.

My question is what is the best way to detect the missing integer in the sorted big integer sets?

My understanding (after reading all the answers):

Assuming we are talking about 32-bit integers, there are 2³² = 4*10⁹ distinct integers.

Case 1: we have 1 GB = 1 * 10⁹ * 8 bits = 8 billion bits memory.

Solution:

If we use one bit representing one distinct integer, it is enough. we don't need sort.

Implementation:

int radix = 8;
byte[] bitfield = new byte[0xffffffff/radix];
void F() throws FileNotFoundException{
    Scanner in = new Scanner(new FileReader("a.txt"));
    while(in.hasNextInt()){
        int n = in.nextInt();
        bitfield[n/radix] |= (1 << (n%radix));
    }

    for(int i = 0; i< bitfield.lenght; i++){
        for(int j =0; j<radix; j++){
            if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j);
        }
    }
}

Case 2: 10 MB memory = 10 * 10⁶ * 8 bits = 80 million bits

Solution:
For all possible 16-bit prefixes, there are 2¹⁶ number of
integers = 65536, we need 2¹⁶ * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket.
Build the counter of each bucket through the first pass through the file.
Scan the buckets, find the first one who has less than 65536 hit.
Build new buckets whose high 16-bit prefixes are we found in step2
through second pass of the file
Scan the buckets built in step3, find the first bucket which doesnt
have a hit.
The code is very similar to above one.

Conclusion:
We decrease memory through increasing file pass.

^{A clarification for those arriving late: The question, as asked, does not say that there is exactly one integer that is not contained in the file—at least that's not how most people interpret it. Many comments in the comment thread are about that variation of the task, though. Unfortunately the comment that introduced it to the comment thread was later deleted by its author, so now it looks like the orphaned replies to it just misunderstood everything. It's very confusing, sorry.}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱本泡沫多脆弱 2024-12-07 18:20:52

假设“整数”表示 32 位：对于所有可能的 16 位，10 MB 的空间足以让您计算输入文件中具有任何给定 16 位前缀的数字有多少个位前缀一次通过输入文件。至少其中一个桶的撞击次数少于 2¹⁶ 次。进行第二遍以查找该存储桶中哪些可能的数字已被使用。

如果它意味着超过 32 位，但仍具有有限大小：按照上述操作，忽略所有恰好落在（有符号或无符号；您的选择）32 位范围之外的输入数字。

如果“整数”表示数学整数：通读一次输入并跟踪您见过的最长数字的~~最大数字~~长度。完成后，输出~~最大加一~~一个多一位的随机数。（文件中的一个数字可能是一个大数，需要超过 10 MB 才能准确表示，但如果输入是一个文件，那么您至少可以表示适合的任何内容的长度它）。

生成一个不在 40 亿给定整数之中的整数

我的分析：

我的理解（阅读所有答案后）：

情况 1：我们有 1 GB = 1 * 109 * 8 位 = 80 亿位内存。

解决方案：

实现：

情况 2：10MB 内存 = 10 * 106 * 8 位 = 8000 万位

解决方案：

My analysis:

My understanding (after reading all the answers):

Case 1: we have 1 GB = 1 * 109 * 8 bits = 8 billion bits memory.

Solution:

Implementation:

Case 2: 10 MB memory = 10 * 106 * 8 bits = 80 million bits

Solution:

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（30）

理论

证明

基于 0 的范围的算法

任意范围的算法

任意范围

另一种方法

The Theory

The Proof

The Algorithm For 0 Based Ranges

The Algorithm For Arbitrary Based Ranges

Arbitrary Ranges

Another Approach

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

情况 1：我们有 1 GB = 1 * 10⁹ * 8 位 = 80 亿位内存。

情况 2：10MB 内存 = 10 * 10⁶ * 8 位 = 8000 万位

Case 1: we have 1 GB = 1 * 10⁹ * 8 bits = 8 billion bits memory.

Case 2: 10 MB memory = 10 * 10⁶ * 8 bits = 80 million bits

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。