检查 10 亿个手机号码是否重复

发布于 2024-12-08 15:43:34 字数 1658 浏览 1 评论 0原文

这是一道面试题:

有10亿个手机号码,有11位数字,它们随机存储在一个文件中,用于 例如12345678910,第一个数字必须是1。检查这些数字,看看是否有 一个有重复的,只是看看是否存在重复,如果发现重复, 返回 True,或者返回 False。 仅允许 10 MB 内存。

这是我的解决方案:

使用 hash(num)%1000 将所有这些数字散列到 1000 个文件中,然后重复项应该落入同一个文件中。

经过哈希处理后,我得到了 1000 个小文件,每个小文件最多包含 100 万 个数字,对吗?我对此不太确定,我只是这样做10亿/1000 = 100万

然后,对于每个文件,构建一个哈希表来存储每个数字和代表其出现次数的标志

我猜,需要 5 B 来表示数字,4 B 表示较低的 8 位1 B > 对于上面的 3 位数字;实际上1位就足够了flag,因为我只需要找出是否存在重复,只需要找出重复的次数。但是如何将 1 位 标志应用于每个数字?我很困惑,所以我选择 bool 作为标志,1 B 被采用。 所以最后,哈希表中的每个数字将采用 5B。 + 1B<代表标志> + 4B<下一个指针> = 10B,那么每个文件将占用10M作为哈希表。

这是我的愚蠢的解决方案,请给我一个更好的解决方案。

谢谢。

跟进:

如果这 10 亿个电话号码中没有没有重复,给定一个 电话号码,如何找出给定的电话号码是否在这1个中 十亿个数字?使用尽可能少的内存

我想出了 2 个解决方案,

  1. 电话号码可以使用 5B 表示,正如我上面所说,扫描文件,一次读取一个数字,然后将给定的数字与从文件中读取的数字进行异或< /code>,如果结果是0,那么给定的那个在文件中,需要O(n)时间,对吧?

  2. 将这些数字按照前导位划分为2个小文件,也就是说,那些前导1-的数字bit 转到一个文件,前导 0 位 转到另一个文件,同时统计每个文件中有多少个数字,如果给定的数字落入 1 位文件且 1-位文件的 countnot full,然后根据次前导位再次对1位文件进行分区,并递归地检查给定的数字;如果 1 位文件已满,那么给定的数字必须在文件中,这将需要 O(logn) 时间,对吗?

It's an interview question:

There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for
example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is
one with duplicate, just see if duplicate exists, if duplicate found,
return True, or return False.
Only 10 MB memory allowed.

Here is my solution:

Hash all these numbers into 1000 files using hash(num)%1000, then the duplicates should fall into the same file.

After the hashing, I got 1000 small files, each of which contains 1 million numbers at most, right? I'm not sure about this, I simply do it 1 billion / 1000 = 1 million.

Then for each file, build a hash table to store each number and a flag representing its occurrence.

I guess, it will take 5 B to represent the number, 4 B for the lower 8 digits and 1 B for the upper 3 digits; and actually 1 bit will suffice the flag, because I just need to find out whether duplicate exists, only how many times. But how can I apply the 1 bit flag to each number? I'm stumbled, so I choose bool to be the flag, 1 B is taken.
So finally, each number in the hash table will take 5B<for number> + 1B<for flag> + 4B<for the next-pointer> = 10B, then each file will take 10M for the hash table.

That's my stupid solution, Please give me a better one.

Thanks.

FOLLOW UP:

If there are no duplicates in these 1 billion phone numbers, given one
phone number, how to find out the given one is or is not in these 1
billion numbers? Use as few memory as possible.

I came up with 2 solutions,

  1. The phone number can be represented using 5B as I said above, scan through the file, read one number a time, and xor the given number with the one read from the file, if the result is 0, then the given one is in the file, it'll take O(n) time, right?

  2. Partition these numbers into 2 small files according to the leading bit, which means, those numbers with a leading 1-bit go to a file, leading 0-bit go to another file, meanwhile count how many numbers in each file, if the given number fall into the 1-bit file and the 1-bit file's count is not full, then again partition the 1-bit file according to the secondary leading-bit, and check the given number recursively; if the 1-bit file is full, then the given number gotta be in the file, it'll take O(logn) time, right?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

酒浓于脸红 2024-12-15 15:43:34

最快的解决方案(也在程序员开销方面:)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

检查内存使用约束是否满足 pmap $(pidof sort) 例如:

Fastest solution (also in terms of programmer overhead :)

# Generate some 'phones'
yes 1 | perl -wne 'chomp; ++$a; print $_."$a\n";' > phones.txt

# Split phones.txt in 10MB chunks
split -C 10000000 phones.txt

# Sort each 10MB chunk with 10MB of memory
for i in x??; do sort -S 10M $i > $i.srt; echo -ne "$i.srt\0" >> merge.txt; done

# Merge the shorted chunks with 10MB of memory
sort -S 10M --files0-from=merge.txt -m > sorted.txt

# See if there is any duplicates
test -z $(uniq -d merge.txt)

Check that the memory usage constraint is met with pmap $(pidof sort) for example:

七婞 2024-12-15 15:43:34

哈希后得到1000个小文件,每个小文件包含1
最多百万个数字,对吧

不对,在极端情况下,一个文件可能包含所有数字。

根据数字的前 x 或后 x 位数创建文件(忽略开头 1)。创建这些文件时,您实际上可以删除这些数字,因为它们在文件中是相等的。这比散列要好得多,因为虽然所有数字仍然可以存放在一个文件中,但现在这些数字的范围是有限的,因此您可以将其放入 10MB。

每个数字都可以用一个简单的位来表示,因为您需要的唯一信息是该数字之前是否出现过。您不必存储实际的数字,该位的地址就是数字。在 10MB 中,您可以存储 80M 位,因此您将需要 1G/80M = 12.5 个文件,但请记住,这些数字必须不同,因此实际上您将需要 100 个文件 (x=2)。

最后,您不必创建这些文件,您还可以多次扫描整个文件。在这种情况下,内存中可以有多个位图,因为一个位图不占用 10MB。

我强烈建议阅读这本书,它以几乎相同的示例开头:http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880

After the hashing, I got 1000 small files, each of which contains 1
million numbers at most, right

Not true, in extreme case it's possible that one file contains all the numbers.

Create the files based on the first or last x digits of the numbers (ignore the starting 1). When creating those files you can actually chop those digits because they are equal within a file. This is a lot better than hashing because although all the numbers can still end up in one file, now the range of those numbers is limited, so you can fit it into 10MB.

Each number can be represeted by a simple bit because the only information you need is whether the number occured previously. You don't have to store the actual numbers, the address of the bit is the number. In 10MB you can store 80M bits, so you will need 1G/80M = 12.5 files, but remember, those digits must differ so actually you will need 100 files (x=2).

Finally, you don't have to create those files, you can also scan the whole file multiple times. In this case you can have multiple bit-maps in memory as one doesn't occupy 10MB.

I strongly suggest reading this book, it starts with an almost identical example: http://www.amazon.co.uk/Programming-Pearls-ACM-Press-Bentley/dp/0201657880

原来分手还会想你 2024-12-15 15:43:34

不需要哈希,10M = 83886080 位,将每个数字放入 [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999)(不考虑第一位数字),大约 999999999 / 83886080 = 120文件,然后构建位设置,总共需要 O(n) 时间。

No need for hash, 10M = 83886080 bits, put each number into [0, 83886080), [83886080, 83886080 * 2) ... [xx, 9999999999) (don't consider first digit), about 999999999 / 83886080 = 120 files, then build the bit set, it takes O(n) totally.

可是我不能没有你 2024-12-15 15:43:34

您可以遵循位集技术。请参阅此问题和答案:查找不属于四十亿的整数给定的

You can follow the bitset technique. Refer to this question and answers : Find an integer not among four billion given ones

怪我入戏太深 2024-12-15 15:43:34

面试问题仅对所使用的记忆力施加限制,而不对提供答案所需的时间施加限制。

因此,这样实现这个问题是合理的:

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

这需要大量的时间来处理十亿个数字(O(n^2)),但不会占用超过 10MB 的内存空间。

the interview question imposes only a limit on the memory used, not on the time it takes to provide an answer.

it is thus reasonable to implement this question like this:

take the first number
compare it to all numbers following it
take the second number
compare it to all numbers following it
...

this takes an enormous amount of time for processing the billion numbers (O(n^2)), but does not take more than 10MB of memory space.

眼波传意 2024-12-15 15:43:34

您可以使用包含 m 位数组并使用 k 个哈希函数的布隆过滤器。
虽然我不确定您可能需要多少个哈希函数。

You can use Bloom Filters which contains m bit array and uses k hash functions.
Though I am not sure about how many hash functions you may need.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文