大数据排序与搜索
我有两个数据文件,每个文件有 100 个字符行。文件 A:108 行,文件 B:106 行。我需要找到文件 B 中不在文件 A 中的所有字符串。
起初我想将这两个文件输入 mysql,但看起来它永远无法完成在 108 记录上创建唯一键。
我正在等待您对此的建议。
I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.
I'm waiting for your suggestions on this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以在没有数据库的情况下执行此操作。关键是减小 A 的大小,因为 A 比 B 大得多。具体操作方法如下:
使用适当的哈希函数对 B 文件中的字符串计算 64 位哈希值。将它们存储在内存中(在哈希表中),您可以这样做,因为 B 很小。然后逐行对 A 文件中的所有字符串进行哈希处理,并查看每个字符串是否与 B 文件的哈希值匹配。任何具有匹配哈希值的行(从 B 到其中之一)都应存储在文件 C 中。
此过程完成后,文件 C 将具有 A 的可能匹配字符串(到 B)的小子集。现在您有一个小得多的文件 C,您需要将其与 B 的行进行比较。这将问题简化为实际上可以将 C 中的所有行加载到内存中(作为哈希表)并比较 B 的每一行以查看它是否在 C 中的问题。
You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:
Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.
When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.
您可以稍微改进@michael-goldshteyn的答案(https://stackoverflow.com/a/3926745/179529)。由于您需要找到 B 中不在 A 中的所有字符串,因此当您将 B 的元素与 A 中的元素进行比较并找到匹配项时,您可以从 B 的元素的哈希表中删除任何项目。留在哈希表中的是文件 A 中未找到的元素。
You can slightly improve on @michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.
对于你提到的大小,你应该能够一次将所有 B 保留在内存中,这样你就可以做 Goldshteyn 答案的简化版本; python 中是这样的:
我已经在原子处理器上对两个大小为 10^5 和 10^7 的文件(每行约 8 个字符)进行了测试。 /usr/bin/time 的输出:
For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:
I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time: