检查 10 亿个手机号码是否重复

发布于 2024-12-08 15:43:34 字数 1658 浏览 1 评论 0原文

这是一道面试题：

有10亿个手机号码，有11位数字，它们随机存储在一个文件中，用于例如12345678910，第一个数字必须是1。检查这些数字，看看是否有一个有重复的，只是看看是否存在重复，如果发现重复，返回 True，或者返回 False。 仅允许 10 MB 内存。

这是我的解决方案：

使用 hash(num)%1000 将所有这些数字散列到 1000 个文件中，然后重复项应该落入同一个文件中。

经过哈希处理后，我得到了 1000 个小文件，每个小文件最多包含 100 万 个数字，对吗？我对此不太确定，我只是这样做10亿/1000 = 100万。

然后，对于每个文件，构建一个哈希表来存储每个数字和代表其出现次数的标志。

我猜，需要 5 B 来表示数字，4 B 表示较低的 8 位 和 1 B > 对于上面的 3 位数字；实际上1位就足够了flag，因为我只需要找出是否存在重复，只需要找出重复的次数。但是如何将 1 位 标志应用于每个数字？我很困惑，所以我选择 bool 作为标志，1 B 被采用。所以最后，哈希表中的每个数字将采用 5B。 + 1B<代表标志> + 4B<下一个指针> = 10B，那么每个文件将占用10M作为哈希表。

这是我的愚蠢的解决方案，请给我一个更好的解决方案。

谢谢。

跟进：

如果这 10 亿个电话号码中没有没有重复，给定一个电话号码，如何找出给定的电话号码是否在这1个中十亿个数字？使用尽可能少的内存。

我想出了 2 个解决方案，

电话号码可以使用 5B 表示，正如我上面所说，扫描文件，一次读取一个数字，然后将给定的数字与从文件中读取的数字进行异或< /code>，如果结果是0，那么给定的那个在文件中，需要O(n)时间，对吧？
将这些数字按照前导位划分为2个小文件，也就是说，那些前导1-的数字bit 转到一个文件，前导 0 位转到另一个文件，同时统计每个文件中有多少个数字，如果给定的数字落入 1 位文件且 1-位文件的 count 是 not full，然后根据次前导位再次对1位文件进行分区，并递归地检查给定的数字；如果 1 位文件已满，那么给定的数字必须在文件中，这将需要 O(logn) 时间，对吗？

原文

It's an interview question:

There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for
example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is
one with duplicate, just see if duplicate exists, if duplicate found,
return True, or return False.
Only 10 MB memory allowed.

Here is my solution:

Hash all these numbers into 1000 files using hash(num)%1000, then the duplicates should fall into the same file.

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million.

Then for each file, build a hash table to store each number and a flag representing its occurrence.

I guess, it will take 5 B to represent the number, 4 B for the lower 8 digits and 1 B for the upper 3 digits; and actually 1 bit will suffice the flag, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit flag to each number? I'm stumbled, so I choose bool to be the flag, 1 B is taken.
So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B, then each file will take 10M for the hash table.

That's my stupid solution, Please give me a better one.

Thanks.

FOLLOW UP:

If there are no duplicates in these 1 billion phone numbers, given one
phone number, how to find out the given one is or is not in these 1
billion numbers? Use as few memory as possible.

I came up with 2 solutions,

The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and xor the given number with the one read from the file, if the result is 0, then the given one is in the file, it'll take O(n) time, right?
Partition these numbers into 2 small files according to the leading bit, which means, those numbers with a leading 1-bit go to a file, leading 0-bit go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's count is not full, then again partition the 1-bit file according to the secondary leading-bit, and check the given number recursively; if the 1-bit file is full, then the given number gotta be in the file, it'll take O(logn) time, right?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酒浓于脸红 2024-12-15 15:43:34

最快的解决方案（也在程序员开销方面:）

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

检查内存使用约束是否满足 pmap $(pidof sort) 例如：

Fastest solution (also in terms of programmer overhead :)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

Check that the memory usage constraint is met with pmap $(pidof sort) for example:

回复收藏 0 原文

七婞 2024-12-15 15:43:34

哈希后得到1000个小文件，每个小文件包含1
最多百万个数字，对吧

不对，在极端情况下，一个文件可能包含所有数字。

根据数字的前 x 或后 x 位数创建文件（忽略开头 1）。创建这些文件时，您实际上可以删除这些数字，因为它们在文件中是相等的。这比散列要好得多，因为虽然所有数字仍然可以存放在一个文件中，但现在这些数字的范围是有限的，因此您可以将其放入 10MB。

每个数字都可以用一个简单的位来表示，因为您需要的唯一信息是该数字之前是否出现过。您不必存储实际的数字，该位的地址就是数字。在 10MB 中，您可以存储 80M 位，因此您将需要 1G/80M = 12.5 个文件，但请记住，这些数字必须不同，因此实际上您将需要 100 个文件 (x=2)。

最后，您不必创建这些文件，您还可以多次扫描整个文件。在这种情况下，内存中可以有多个位图，因为一个位图不占用 10MB。

我强烈建议阅读这本书，它以几乎相同的示例开头：http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880

回复收藏 0 原文

原来分手还会想你 2024-12-15 15:43:34

不需要哈希，10M = 83886080 位，将每个数字放入 [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999)（不考虑第一位数字），大约 999999999 / 83886080 = 120文件，然后构建位设置，总共需要 O(n) 时间。

回复收藏 0 原文

可是我不能没有你 2024-12-15 15:43:34

您可以遵循位集技术。请参阅此问题和答案：查找不属于四十亿的整数给定的

回复收藏 0 原文

怪我入戏太深 2024-12-15 15:43:34

面试问题仅对所使用的记忆力施加限制，而不对提供答案所需的时间施加限制。

因此，这样实现这个问题是合理的：

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

这需要大量的时间来处理十亿个数字（O(n^2)），但不会占用超过 10MB 的内存空间。

the interview question imposes only a limit on the memory used, not on the time it takes to provide an answer.

it is thus reasonable to implement this question like this:

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

this takes an enormous amount of time for processing the billion numbers (O(n^2)), but does not take more than 10MB of memory space.

回复收藏 0 原文

眼波传意 2024-12-15 15:43:34

您可以使用包含 m 位数组并使用 k 个哈希函数的布隆过滤器。
虽然我不确定您可能需要多少个哈希函数。

回复收藏 0 原文

~没有更多了~

关于作者

不再让梦枯萎

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

检查 10 亿个手机号码是否重复

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

检查 10 亿个手机号码是否重复

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。