检查 10 亿个手机号码是否重复
这是一道面试题:
有10亿个手机号码,有11位数字,它们随机存储在一个文件中,用于 例如12345678910,第一个数字必须是1。检查这些数字,看看是否有 一个有重复的,只是看看是否存在重复,如果发现重复, 返回 True,或者返回 False。 仅允许 10 MB 内存。
这是我的解决方案:
使用 hash(num)%1000
将所有这些数字散列到 1000 个文件中,然后重复项应该落入同一个文件中。
经过哈希处理后,我得到了 1000 个小文件,每个小文件最多包含 100 万
个数字,对吗?我对此不太确定,我只是这样做
10亿/1000 = 100万
。
然后,对于每个文件,构建一个哈希表来存储每个数字和代表其出现次数的标志
。
我猜,需要 5 B
来表示数字,4 B
表示较低的 8 位
和 1 B
> 对于上面的 3 位数字
;实际上1位
就足够了flag
,因为我只需要找出是否存在重复,只需要找出重复的次数。但是如何将 1 位
标志应用于每个数字?我很困惑,所以我选择 bool
作为标志,1 B
被采用。 所以最后,哈希表中的每个数字将采用 5B
,那么每个文件将占用10M
作为哈希表。
这是我的愚蠢的解决方案,请给我一个更好的解决方案。
谢谢。
跟进:
如果这 10 亿个电话号码中没有
没有重复
,给定一个 电话号码,如何找出给定的电话号码是否在
这1个中 十亿个数字?使用尽可能少的内存。
我想出了 2 个解决方案,
电话号码可以使用 5B 表示,正如我上面所说,扫描文件,一次读取一个数字,然后
将给定的数字与从文件中读取的数字进行异或< /code>,如果结果是
0
,那么给定的那个在文件中,需要O(n)
时间,对吧?将这些数字按照
前导位
划分为2个小文件
,也就是说,那些前导1-的数字bit
转到一个文件,前导 0 位
转到另一个文件,同时统计每个文件中有多少个数字,如果给定的数字落入 1 位文件且 1-位文件的count
是not full
,然后根据次前导位
再次对1位文件进行分区,并递归地检查给定的数字;如果 1 位文件已满
,那么给定的数字必须在文件中,这将需要O(logn)
时间,对吗?
It's an interview question:
There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for
example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is
one with duplicate, just see if duplicate exists, if duplicate found,
return True, or return False.
Only 10 MB memory allowed.
Here is my solution:
Hash all these numbers into 1000 files using hash(num)%1000
, then the duplicates should fall into the same file.
After the hashing, I got 1000 small files, each of which contains 1 million
numbers at most
, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million
.
Then for each file, build a hash table to store each number and a flag
representing its occurrence.
I guess, it will take 5 B
to represent the number, 4 B
for the lower 8 digits
and 1 B
for the upper 3 digits
; and actually 1 bit
will suffice the flag
, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit
flag to each number? I'm stumbled, so I choose bool
to be the flag, 1 B
is taken.
So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B
, then each file will take 10M
for the hash table.
That's my stupid solution, Please give me a better one.
Thanks.
FOLLOW UP:
If there are
no duplicates
in these 1 billion phone numbers, given one
phone number, how to find out the given oneis or is not in
these 1
billion numbers? Use as few memory as possible.
I came up with 2 solutions,
The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and
xor the given number with the one read from the file
, if the result is0
, then the given one is in the file, it'll takeO(n)
time, right?Partition
these numbers into2 small files
according to theleading bit
, which means, those numbers with aleading 1-bit
go to a file,leading 0-bit
go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file'scount
isnot full
, thenagain partition
the 1-bit file according to thesecondary leading-bit
, and check the given number recursively; if the 1-bit fileis full
, then the given number gotta be in the file, it'll takeO(logn)
time, right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
最快的解决方案(也在程序员开销方面:)
检查内存使用约束是否满足 pmap $(pidof sort) 例如:
Fastest solution (also in terms of programmer overhead :)
Check that the memory usage constraint is met with pmap $(pidof sort) for example:
不对,在极端情况下,一个文件可能包含所有数字。
根据数字的前 x 或后 x 位数创建文件(忽略开头 1)。创建这些文件时,您实际上可以删除这些数字,因为它们在文件中是相等的。这比散列要好得多,因为虽然所有数字仍然可以存放在一个文件中,但现在这些数字的范围是有限的,因此您可以将其放入 10MB。
每个数字都可以用一个简单的位来表示,因为您需要的唯一信息是该数字之前是否出现过。您不必存储实际的数字,该位的地址就是数字。在 10MB 中,您可以存储 80M 位,因此您将需要 1G/80M = 12.5 个文件,但请记住,这些数字必须不同,因此实际上您将需要 100 个文件 (x=2)。
最后,您不必创建这些文件,您还可以多次扫描整个文件。在这种情况下,内存中可以有多个位图,因为一个位图不占用 10MB。
我强烈建议阅读这本书,它以几乎相同的示例开头:http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880
Not true, in extreme case it's possible that one file contains all the numbers.
Create the files based on the first or last x digits of the numbers (ignore the starting 1). When creating those files you can actually chop those digits because they are equal within a file. This is a lot better than hashing because although all the numbers can still end up in one file, now the range of those numbers is limited, so you can fit it into 10MB.
Each number can be represeted by a simple bit because the only information you need is whether the number occured previously. You don't have to store the actual numbers, the address of the bit is the number. In 10MB you can store 80M bits, so you will need 1G/80M = 12.5 files, but remember, those digits must differ so actually you will need 100 files (x=2).
Finally, you don't have to create those files, you can also scan the whole file multiple times. In this case you can have multiple bit-maps in memory as one doesn't occupy 10MB.
I strongly suggest reading this book, it starts with an almost identical example: http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880
不需要哈希,10M = 83886080 位,将每个数字放入 [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999)(不考虑第一位数字),大约 999999999 / 83886080 = 120文件,然后构建
位设置
,总共需要 O(n) 时间。No need for hash, 10M = 83886080 bits, put each number into [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999) (don't consider first digit), about 999999999 / 83886080 = 120 files, then build the
bit set
, it takes O(n) totally.您可以遵循位集技术。请参阅此问题和答案:查找不属于四十亿的整数给定的
You can follow the bitset technique. Refer to this question and answers : Find an integer not among four billion given ones
面试问题仅对所使用的记忆力施加限制,而不对提供答案所需的时间施加限制。
因此,这样实现这个问题是合理的:
这需要大量的时间来处理十亿个数字(O(n^2)),但不会占用超过 10MB 的内存空间。
the interview question imposes only a limit on the memory used, not on the time it takes to provide an answer.
it is thus reasonable to implement this question like this:
this takes an enormous amount of time for processing the billion numbers (O(n^2)), but does not take more than 10MB of memory space.
您可以使用包含 m 位数组并使用 k 个哈希函数的布隆过滤器。
虽然我不确定您可能需要多少个哈希函数。
You can use Bloom Filters which contains m bit array and uses k hash functions.
Though I am not sure about how many hash functions you may need.