如何确定两个数据列表中的差异
这是 CS 人员运用理论的练习。
假设您有 2 个装有元素的容器。 文件夹、URL、文件、字符串,这真的不重要。
计算添加和删除的AN算法是什么?
注意:如果有多种方法可以解决此问题,请为每个答案发布一个,以便进行分析和投票。
编辑:所有答案都可以用 4 个容器解决问题。 可以只使用前2个吗?
This is an exercise for the CS guys to shine with the theory.
Imagine you have 2 containers with elements. Folders, URLs, Files, Strings, it really doesn't matter.
What is AN algorithm to calculate the added and the removed?
Notice: If there are many ways to solve this problem, please post one per answer so it can be analysed and voted up.
Edit: All the answers solve the matter with 4 containers. Is it possible to use only the initial 2?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
假设您有两个唯一项目列表,并且顺序并不重要,您可以将它们视为集合而不是列表
如果您考虑维恩图,其中列表 A 为一个圆圈,列表 B 为另一个圆圈,则这两者的交集就是常量池。
从 A 和 B 中删除该交集中的所有元素,A 中剩下的所有元素都被删除,而 B 中剩下的所有元素都被添加。
因此,迭代 A 查找 B 中的每个项目。如果找到它,请将其从 A 和 B 中删除
然后 A 是已删除的内容的列表,B 是
我认为添加的内容的列表......
[编辑] 好吧,有了新的“仅 2 个容器”限制,同样的情况仍然成立:
那么您就不会构建新列表,也不会销毁旧列表......但与前面的示例一样,它会花费更长的时间,您可以循环遍历较短的列表并从较长的列表中删除元素。 这里你需要做两个列表
我认为我的第一个解决方案没有使用 4 个容器,它只是破坏了两个;-)
Assuming you have two lists of unique items, and the ordering doesn't matter, you can think of them both as sets rather than lists
If you think of a venn diagram, with list A as one circle and list B as the other, then the intersection of these two is the constant pool.
Remove all the elements in this intersection from both A and B, and and anything left in A has been deleted, whilst anything left in B has been added.
So, iterate through A looking for each item in B. If you find it, remove it from both A and B
Then A is a list of things that were deleted, and B is a list of things that were added
I think...
[edit] Ok, with the new "only 2 container" restriction, the same still holds:
Then you aren't constructing a new list, or destroying your old ones...but it will take longer as with the previous example, you could just loop over the shorter list and remove the elements from the longer. Here you need to do both lists
An I'd argue my first solution didn't use 4 containers, it just destroyed two ;-)
我已经有一段时间没有这样做了,但我相信算法是这样的......
关于右列表与左列表的关系,删除包含删除的项目和添加 现在包含新项目。
I have not done this in a while but I believe the algorithm goes like this...
In regards to right-list's relation to left-list, deletes contains items removed and adds now contains new items.
缺失信息:如何定义添加/删除? 例如,如果列表(A 和 B)显示服务器 A 和服务器 B 上的相同目录,则说明是同步的。 如果我现在等待 10 天,再次生成列表并进行比较,我如何判断某些内容是否已被删除? 我不能。 我只能告诉服务器 A 上有文件在服务器 B 上找不到,和/或反之亦然。 无论是因为文件已添加到服务器 A(因此在 B 上找不到该文件)还是因为文件已在服务器 B 上删除(因此在 B 上不再找到该文件),我无法仅通过文件名列表来确定某些内容。
对于我建议的解决方案,我将假设您有一个名为 OLD 的列表和一个名为 NEW 的列表。 在旧版中找到但在新版中未找到的所有内容均已被删除。 已添加在新设备上找到但在旧设备上未找到的所有内容(例如,同一服务器上同一目录的内容,但列表是在不同日期创建的)。
此外,我假设没有重复项。 这意味着任一列表中的每个项目在以下意义上都是唯一的:如果我将此项目与列表中的任何其他项目进行比较(无论此比较如何工作),我总是可以说该项目更小 或大于我与之比较的那个,但永远不会相等。 例如,在处理字符串时,我可以按字典顺序比较它们,并且相同的字符串永远不会在列表中出现两次。
在这种情况下,最简单的(但不一定是最佳解决方案)是:
对旧列表进行排序。 例如,如果列表由字符串组成,则按字母顺序对它们进行排序。 排序是必要的,因为这意味着我可以使用二分搜索来快速找到列表中的对象,假设它确实存在于那里(或者快速确定它根本不存在于列表中)。 如果列表未排序,则查找对象的复杂度为 O(n)(我需要查看列表中的每个项目)。 如果列表已排序,复杂度仅为 O(log n),因为每次尝试匹配列表中的项目后,我总是可以排除列表中 50% 不匹配的项目。 即使列表有 100 个项目,找到一个项目(或检测到该项目不在列表中)最多需要 7 次测试(或者是 8 次?无论如何,远少于 100 次)。 新列表不必排序。
现在我们执行列表消除。 对于 NEW 列表中的每个项目,尝试在 OLD 列表中找到该项目(使用二分搜索)。 如果找到该项目,请从旧列表中删除该项目,并同时将其从新列表中删除。 这也意味着消除过程越深入,列表就越小,因此查找将变得越来越快。 由于从列表中删除项目不会影响列表的正确排序顺序,因此在消除阶段无需使用旧列表。
在消除结束时,两个列表可能都是空的,在这种情况下它们是相等的。 如果它们不为空,则旧列表中仍然存在的所有项目都是新列表中缺少的项目(否则我们已删除它们),因此这些是已删除的项目。 仍在新列表中的所有项目都是不在旧列表中的项目(同样,我们已经删除了它们),因此这些是添加的项目。
Missing information: How do you define added/removed? E.g. if the lists (A and B) show the same directory on Server A and Server B, that is in sync. If I now wait for 10 days, generate the lists again and compare them, how can I tell if something has been removed? I cannot. I can only tell there are files on Server A not found on Server B and/or the other way round. Whether that is because a file has been added to Server A (thus the file is not found on B) or a file has been deleted on Server B (thus the file is not found on B anymore) is something I cannot determine by just having a list of file names.
For the solution I suggest, I will just assume that you have one list named OLD and one list named NEW. Everything found on OLD but not on NEW has been removed. Everything found on NEW, but not on OLD has been added (e.g. the content of the same directory on the same server, however lists have been created at different dates).
Further I will assume there are no duplicates. That means every item on either list is unique in the sense of: If I compare this item to any other item on the list (no matter how this compare works), I can always say the item is either smaller or bigger than the one I'm comparing it to, but never equal. E.g. when dealing with strings, I can compare them lexicographically and the same string is never twice in the list.
In that case the simplest (not necessarily best solution, though) is:
Sort the OLD lists. E.g. if the list consists of strings, sort them alphabetically. Sorting is necessary, because it means I can use binary search to quickly find an object in the list, assuming it does exist there (or to quickly determine, it does not exist in the list at all). If the list is unsorted, finding the object has a complexity of O(n) (I need to look at every single item on the list). If the list is sorted, complexity is only O(log n), as after every try to match an item on the list I can always exclude 50% of the items on the list not being a match. Even if the list has 100 items, finding an item (or detecting that the item is not on the list) takes at most 7 tests (or is it 8? Anyway, far less than 100). The NEW list doesn't have to be sorted.
Now we perform list elimination. For every item on the NEW list, try to find this item on the OLD list (using binary search). If the item is found, remove this item from the OLD list and also remove it from the NEW list. This also means the lists get smaller the further the elimination progresses and thus the lookups will become faster and faster. Since removing an item from the a list has no effect on the correct sort order of the lists, there is no need to ever resort the OLD list during the elimination phase.
At the end of elimination, both lists might be empty, in which case they were equal. If they are not empty, all items still on the OLD list are items missing on the NEW list (otherwise we had removed them), hence these are the removed items. All items still on the NEW list are items that were not on the OLD list (again, we had removed them otherwise), hence these are the added items.
乔说的话。 而且,如果列表太大而无法放入内存,请使用外部文件排序实用程序或合并排序。
What Joe said. And, if the lists are too large to fit into memory, use an external file sorting utility or a Merge sort.
列表中的对象是否“唯一”? 在这种情况下,我将首先构建两个映射(哈希映射),然后扫描列表并查找映射中的每个对象。
对于 Ruby 和 Java 可怕的元语言混合感到抱歉 :-P
最后 removedElements 将包含属于 list1 的元素,但不属于 list2,并且 addedElements 将包含属于list2的元素。
整个操作的成本是 O(4*N),因为映射/字典中的查找可以被认为是恒定的。 另一方面,线性/二元搜索列表中的每个元素将实现 O(N^2)。
编辑:再考虑一下将最后一个检查移到第二个循环中,您可以删除其中一个循环...但这很丑陋...:)
Are the objects in the list "unique"? In this case I would first build two maps (hashmaps) and then scan the lists and lookup every object in the maps.
Sorry for the horrible meta-language mixing Ruby and Java :-P
In the end removedElements will contain the elements belonging to list1, but not to list2, and addedElements will contain the elements belonging to list2.
The cost of the whole operation is O(4*N) since the lookup in the map/dictionary may be considered constant. On the other hand linear/binary searching each elements in the lists will make that O(N^2).
EDIT: on a second thought moving the last check into the second loop you may remove one of the loops... but that's ugly... :)