面试:列出内存有限的交集

发布于 2024-09-11 06:22:56 字数 101 浏览 2 评论 0原文

给定两组整数,大小为 M 和 N,其中 M < N. 对这两个集合执行内等连接(即找到两个列表的交集)。如果两个列表都在文件中并且可用内存的大小为 K < ,则如何执行它M<氮

You are given two sets of integers, sizes M and N with M < N. Perform inner equal join on these two sets (i.e., find intersection of the two lists). How to perform it if both the lists are in files and the available memory is of size K < M < N

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

束缚m 2024-09-18 06:22:56

首先使用就地排序算法对两个列表进行排序。

一旦 M 和 N 都排序完毕,计算交集就像一场竞赛。在每个列表的开头放置两个标记 ab。执行这些步骤,直到任何标记到达列表末尾。用伪代码表示:

until a > M.size or b > N.size
    if M[a] < N[b]
        a++
        continue
    if N[b] < M[a]
        b++
        continue
    print M[a] // common element
    // Advance both indexes to avoid infinite loop
    a++
    b++

给定一个不错的就地排序算法,其复杂度将为O(nlogn)

Using an in-place sorting algorithm to sort both the lists first.

Once both M an N are sorted, then calculating the intersection is like a race. Place two markers a and b at the beginning of each list. Do these steps until any marker reaches the end of the list. In pseudocode:

until a > M.size or b > N.size
    if M[a] < N[b]
        a++
        continue
    if N[b] < M[a]
        b++
        continue
    print M[a] // common element
    // Advance both indexes to avoid infinite loop
    a++
    b++

Given a decent in-place sorting algorithm, the complexity of this will be O(nlogn).

深海夜未眠 2024-09-18 06:22:56

对于 M 的每个 k/2 子部分

 对于 N 的每个 k/2 子部分 
        对 M 子部分进行排序N子部分
        找到两个子部分的交集 
        搜索结果的每个元素,如果还没有则存储它

现在

for each k/2 subpart of M

 for each k/2 subpart of N 
        sort the M-subpart & N-subpart
        find intersection of both the subparts 
        search each element of the result and store it if it already not

present

亚希 2024-09-18 06:22:56
  1. 将 N 分成 n 个部分 L (L < K)
  2. 一次读取 M 个数字,将该数字与当前部分中的所有数字进行比较 L
  3. 如果发现匹配,则将其写入文件
  1. Split N into n parts L (L < K)
  2. Read M number at a time, compare that number with all the numbers in current part L
  3. If you find the match write it to file
不羁少年 2024-09-18 06:22:56

取 M 的 K/2 个元素和 N 的 K/2 个元素。
对 M 子集和 N 子集进行排序。
现在交叉点是微不足道的。
写入交集,删除交集的元素,写回左侧元素。
继续 M 和 N 的所有其他 K/2 子部分。
现在,您在第三个文件中拥有一些常见元素,以及一些部分排序的列表。
现在,对于 M 和 N 列表中的每个 K / 2(减去删除的元素),比较它们以有效地找到交集。

(您还可以添加额外的 2 M 子集和 N 子集合并轮次以加快交集速度)。

万岁!


执行示例:

Ok let's take 2 lists of 9 integers with values between 0 and 9.
These lists will be
4 2 5 1 9 2 9 0 2
and
4 7 4 0 8 6 2 6 5
(generated by rand())

So we take two sublists : 
4 2 5 and 4 7 4

Let's sort them using quicksort
4 2 5
i   j <- two pointers
pivot = [0] = 4

increase i until it's greater than pivot = 4
i and j now points to 5
decrease j until it's smaller than pivot = 4
j points to 2
i is greater than j, thus except for pivot, the list is sorted.
Swap [0] and [j] to finish sorting
2 4 5.

Except for pivot + i + j, no extra allocation was needed (For 3 elements it's hard to believe, see any quicksort applet to have a better understanding).
Thus, from an algorithmic point of view, pivot + i + j are constant and can be neglected (in fact you would do memory - algorithm size to have the real K).

Do the same for the second set
(4 4 7)

Now with two pointers to the lists initialized to the start of the sublists, compare the pointed values.
2 - 4
Increase first sublist's pointers
4 - 4
Equal -> a first common element, remove them from list [2 5] and [4 7]
5 - 4
Increase second pointer
5 - 7
Increase first pointer
End of this sublists.

Dump this to original lists.
Do this for any other sublists
[1 9 2] and [0 8 6] -> [1 2 9] [0 6 8] -> no intersection
[9 0 2] and [2 6 5] -> [0 9 E] [5 6 E] -> [2]

Now we have : 
[2 5 E 1 2 9 0 9 E] and [4 7 E 0 6 8 5 6 E] with E for empty elements.
Now compare subsets with other subsets (sorted) (excepted for already processed sets (same index))
[2 5 E] with [0 6 8] and [6 5 E] -> becomes [2 E E] and [0 6 8] [6 E E]  (intersection [5])
Then
[1 2 9] with [4 7 E] and [6 E E] -> [1 2 9] and [4 7 E] [6 E E] (no intersection)
Then
[0 9 E] with [4 7 E] and [0 6 8] -> [9 E E] and [4 7 E] [6 8 E] (intersection [0])
End : 
[2 E E 1 2 9 9 E E] [4 7 E 6 8 E 6 E E] with common elements [4 2 5 0]

take K/2 elements of M and K/2 elements of N.
Sort M-subset and N-subset.
Now the intersection is trivial.
Write the intersection, drop elements of the intersection, write back the left elements.
Continue with all other K/2 subparts of M and N.
You have now some common elements in a third file, and some partially sorted lists.
Now for each K / 2 (minus removed elements) of M and N lists, compare them to find intersection efficiently.

(You also can add extra rounds of merging of 2 M-subsets and N-subsets to speed up intersection).

Hurrah !


Example of execution:

Ok let's take 2 lists of 9 integers with values between 0 and 9.
These lists will be
4 2 5 1 9 2 9 0 2
and
4 7 4 0 8 6 2 6 5
(generated by rand())

So we take two sublists : 
4 2 5 and 4 7 4

Let's sort them using quicksort
4 2 5
i   j <- two pointers
pivot = [0] = 4

increase i until it's greater than pivot = 4
i and j now points to 5
decrease j until it's smaller than pivot = 4
j points to 2
i is greater than j, thus except for pivot, the list is sorted.
Swap [0] and [j] to finish sorting
2 4 5.

Except for pivot + i + j, no extra allocation was needed (For 3 elements it's hard to believe, see any quicksort applet to have a better understanding).
Thus, from an algorithmic point of view, pivot + i + j are constant and can be neglected (in fact you would do memory - algorithm size to have the real K).

Do the same for the second set
(4 4 7)

Now with two pointers to the lists initialized to the start of the sublists, compare the pointed values.
2 - 4
Increase first sublist's pointers
4 - 4
Equal -> a first common element, remove them from list [2 5] and [4 7]
5 - 4
Increase second pointer
5 - 7
Increase first pointer
End of this sublists.

Dump this to original lists.
Do this for any other sublists
[1 9 2] and [0 8 6] -> [1 2 9] [0 6 8] -> no intersection
[9 0 2] and [2 6 5] -> [0 9 E] [5 6 E] -> [2]

Now we have : 
[2 5 E 1 2 9 0 9 E] and [4 7 E 0 6 8 5 6 E] with E for empty elements.
Now compare subsets with other subsets (sorted) (excepted for already processed sets (same index))
[2 5 E] with [0 6 8] and [6 5 E] -> becomes [2 E E] and [0 6 8] [6 E E]  (intersection [5])
Then
[1 2 9] with [4 7 E] and [6 E E] -> [1 2 9] and [4 7 E] [6 E E] (no intersection)
Then
[0 9 E] with [4 7 E] and [0 6 8] -> [9 E E] and [4 7 E] [6 8 E] (intersection [0])
End : 
[2 E E 1 2 9 9 E E] [4 7 E 6 8 E 6 E E] with common elements [4 2 5 0]
凉宸 2024-09-18 06:22:56

我认为您可以使用位集来实现此目的。BitSet 每个整数仅消耗一位。
希望这有帮助。


BitSet b=new BitSet();//set 1 with {10,20,30}
BitSet c=new BitSet();//set 2 with {20,30,1,2}
b.set(10);
b.set(20);
b.set(30);
c.set(20);
c.set(30);
c.set(1);
c.set(2);
b.and(c);
System.out.println(b);//{20, 30}

I think you can use bit set for this purpose.BitSet only consumes one bit per integer.
Hope this helps.


BitSet b=new BitSet();//set 1 with {10,20,30}
BitSet c=new BitSet();//set 2 with {20,30,1,2}
b.set(10);
b.set(20);
b.set(30);
c.set(20);
c.set(30);
c.set(1);
c.set(2);
b.and(c);
System.out.println(b);//{20, 30}
说不完的你爱 2024-09-18 06:22:56

嵌套循环连接将占用最少的内存。对于文件 1 中的每一行,您加载文件 2 中的每一行并进行比较。然后我猜您在文件 1 中标记了命中。

根据文件的大小,首先对其中一个文件进行排序可能会更有效(这可以用最少的内存完成)。很大程度上取决于您必须使用的内存量。

A nested loop join will take minimal memory. For each row in file 1 you load each row in file 2 and compare it. Then I guess you mark the hits in file 1.

Depending on the sizes of the files it might be more efficient to sort one of the files first (which can be done with minimal memory). A lot depends on the amount of memory you have to play with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文