大数据排序与搜索

发布于 2024-09-27 11:39:05 字数 183 浏览 5 评论 0原文

我有两个数据文件,每个文件有 100 个字符行。文件 A:108 行,文件 B:106 行。我需要找到文件 B 中不在文件 A 中的所有字符串。
起初我想将这两个文件输入 mysql,但看起来它永远无法完成在 108 记录上创建唯一键。

我正在等待您对此的建议。

I have two files of data, 100 char lines each. File A: 108 lines, file B: 106 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 108 records.

I'm waiting for your suggestions on this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

终止放荡 2024-10-04 11:39:05

您可以在没有数据库的情况下执行此操作。关键是减小 A 的大小,因为 A 比 B 大得多。具体操作方法如下:

使用适当的哈希函数对 B 文件中的字符串计算 64 位哈希值。将它们存储在内存中(在哈希表中),您可以这样做,因为 B 很小。然后逐行对 A 文件中的所有字符串进行哈希处理,并查看每个字符串是否与 B 文件的哈希值匹配。任何具有匹配哈希值的行(从 B 到其中之一)都应存储在文件 C 中。

此过程完成后,文件 C 将具有 A 的可能匹配字符串(到 B)的小子集。现在您有一个小得多的文件 C,您需要将其与 B 的行进行比较。这将问题简化为实际上可以将 C 中的所有行加载到内存中(作为哈希表)并比较 B 的每一行以查看它是否在 C 中的问题。

You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:

Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.

When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.

晨光如昨 2024-10-04 11:39:05

您可以稍微改进@michael-goldshteyn的答案(https://stackoverflow.com/a/3926745/179529)。由于您需要找到 B 中不在 A 中的所有字符串,因此当您将 B 的元素与 A 中的元素进行比较并找到匹配项时,您可以从 B 的元素的哈希表中删除任何项目。留在哈希表中的是文件 A 中未找到的元素。

You can slightly improve on @michael-goldshteyn's answer (https://stackoverflow.com/a/3926745/179529). Since you need to find all the strings in B that are not in A, you can remove any item from the Hash Table of the elements of B, when you compare and find a match for it with the elements in A. The elements that will remain in the Hash Table are the elements that were not found in file A.

反差帅 2024-10-04 11:39:05

对于你提到的大小,你应该能够一次将所有 B 保留在内存中,这样你就可以做 Goldshteyn 答案的简化版本; python 中是这样的:

#!/usr/bin/python3

import sys

if __name__=='__main__':
  b = open(sys.argv[2],'r')
  bs = set()
  for l in b:
    bs.add(l.strip())
  b.close()
  a = open(sys.argv[1],'r')
  for l in a:
    l = l.strip()
    if l in bs:
      bs.remove(l)
  for x in bs:
    print(x)

我已经在原子处理器上对两个大小为 10^5 和 10^7 的文件(每行约 8 个字符)进行了测试。 /usr/bin/time 的输出:

25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
  60298   60298  509244

For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:

#!/usr/bin/python3

import sys

if __name__=='__main__':
  b = open(sys.argv[2],'r')
  bs = set()
  for l in b:
    bs.add(l.strip())
  b.close()
  a = open(sys.argv[1],'r')
  for l in a:
    l = l.strip()
    if l in bs:
      bs.remove(l)
  for x in bs:
    print(x)

I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:

25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
  60298   60298  509244
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文