同步两个相关对象列表的标准算法是什么?
我很确定这一定是在某种教科书中(或更可能在所有教科书中),但我似乎使用了错误的关键字来搜索它......:(
我在编程时面临的一个重复任务是我正在处理来自不同来源的对象列表,我需要以某种方式保持同步,通常有某种“主列表”,例如由某些外部 API 返回,然后是我自己创建的每个对象对应的列表。主列表中的对象(认为“包装器”或“适配器” - 它们通常包含有关特定于我的应用程序的外部对象的扩展信息和/或简化对
问题的所有实例的访问):
- 主列表的实现对我来说是隐藏的;它的接口是固定的
- 两个列表中的元素不兼容分配
- 我可以完全控制从列表的实现
- 我无法控制主列表中元素的顺序(即它不可排序)
- 主列表要么根本不提供有关添加或删除元素的通知,要么通知不可靠,即同步只能按需发生,而不是
- 实时,只要需要就可以简单地从头开始清除和重建从属列表。一个选项:
- 初始化包装对象应该被认为是昂贵的
- 其他对象将保存对包装器的引用
在问题的某些实例中的附加特征:
- 主列表中的元素只能通过读取其属性来识别,而不是通过索引或直接访问它们内存地址:
- 刷新后,主列表可能会返回一组全新的实例,即使它们仍然表示相同的信息
- 访问主列表中元素的唯一接口可能是顺序枚举器
- 大多数情况下, ,主列表中的元素是稳定的,即新元素总是添加在开头或结尾,而不是中间;然而,删除通常可以发生在任何位置
那么我通常会如何解决这个问题?我应该用谷歌搜索的算法的名称是什么?
过去,我以各种方式实现了这一点(参见下面的示例),但总觉得应该有一种更干净、更有效的方法,尤其是不需要两次迭代的方法(每个列表一次)。
下面是一个示例方法:
- 迭代主列表
- 查找“从列表”中的每个项目
- 添加尚不存在的项目
- 以某种方式跟踪两个列表中已存在的项目(例如,通过标记它们或保留另一个列表)
- 当完成后,迭代从属列表并删除所有未标记的对象(参见 4.),并再次从所有其他对象中清除标记
更新 1 感谢您迄今为止的所有回复!我需要一些时间来查看链接。
[...] (文本移至问题主体)
更新 2 将中间段落重组为(希望)更容易解析的项目符号列表,并合并了第一次更新中稍后添加的详细信息。
I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(
A recurring task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list (think "wrappers" or "adapters" - they typically contain extended information about the external objects specific to my application and/or they simplify access to the external objects).
Hard characteristics of all instances of the problem:
- the implementation of the master list is hidden from me; its interface is fixed
- the elements in the two lists are not assignment-compatible
- I have full control over the implementation of the slave list
- I cannot control the order of elements in the master list (i.e. it's not sortable)
- the master list does either not provide notification about added or removed elements at all or notification is unreliable, i.e. the sync can only happen on-demand, not live
- simply clearing and rebuilding the slave list from scratch whenever it's needed is not an option:
- initializing the wrapper objects should be considered expensive
- other objects will hold references to the wrappers
Additional characteristics in some instances of the problem:
- elements in the master list can only be identified by reading their properties rather than accessing them directly by index or memory address:
- after a refresh, the master list might return a completely new set of instances even though they still represent the same information
- the only interface for accessing elements in the master list might be a sequential enumerator
- most of the time, the order of elements in the master list is stable, i.e. new elements are always added either at the beginning or at the end, never in the middle; however, deletion can usually occur at any position
So how would I typically tackle this? What's the name of the algorithm I should google for?
In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way, especially one that did not require two iterations (one over each list).
Here's an example approach:
- Iterate over the master list
- Look up each item in the "slave list"
- Add items that do not yet exist
- Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
- When done, iterate over the slave list and remove all objects that have not been tagged (see 4.) and clear the tag again from all others
Update 1
Thanks for all your responses so far! I will need some time to look at the links.
[...] (text moved to main body of question)
Update 2
Restructered the middle-paragraph into a (hopefully) more easily parseable bullet lists and incorporated details added later in the first update.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
2种典型的解决方案是:
1. 将主列表复制到同步列表。
2. 在所有元素对之间进行 O(N*N) 比较。
您已经排除了智能选项:共享身份、排序和更改通知。
请注意,列表是否可以以有意义的方式排序,甚至完全排序,都无关紧要。例如,当比较两个字符串列表时,最好按字母顺序排序。但如果您按字符串长度对两个列表进行排序,列表比较仍然会更有效!您仍然需要对相同长度的字符串进行完整的成对比较,但这可能会少得多。
The 2 typical solutions are:
1. Copy the master list to the sync list.
2. Do an O(N*N) comparison between all element pairs.
You've excluded the smart options already: shared identity, sorting and change notifications.
Note that it's not relevant whether the lists can be sorted in a meaningful way, or even completely. For instance, when comparing two string lists, it would be ideal to sort alphabetically. But the list comparison would still be more efficient if you'd sort both lists by string length! You'd still have to do a full pairwise comparison of strings of the same length, but that will probably be a much smaller nummber of pairs.
这看起来像集合协调问题,即同步无序数据的问题。有人就此提出了有关SO的问题:设置协调算法的实现。
谷歌上的大部分参考文献都是技术论文摘要。
This looks like the set reconciliation problem i.e. the problem of synchronizing unordered data. A question on SO was asked on this: Implementation of set reconciliation algorithm.
Most of the references on google are to technical paper abstracts.
通常,解决此类问题的最佳方法是不直接解决它们。
如果您确实无法在代码部分使用排序的二进制可搜索容器(例如集合甚至排序的向量),那么...
您的内存非常有限吗?如果没有,那么我只需创建一个包含其中一个列表内容的字典(例如 std::set),然后迭代第二个列表,我希望与第一个列表同步。
这样,您可以执行 nlogn 来创建字典(或者 nX 来创建哈希字典,具体取决于哪一个更有效)+ mlogn 操作来遍历第二个列表并同步它(或者只是 MY) - 如果你真的必须首先使用列表,那么很难被击败 - 如果你需要它,你只做一次也很好,这比保持列表排序要好得多一直以来,这对他们俩来说都是一项 ^2 任务。
Often the best solution to such problems is to not solve them directly.
IF you really can't use a sorted binary searchable container in your part of the code (like a set or even a sorted vector) then...
Are you very memory bound? If not then I'd just create a dictionary (an std::set for example) containing the contents of one of the lists and then just iterate over the second which I want o sync with the first.
This way you're doing nlogn to create the dictionary (or nX for a hash dictionary depending on which will be more efficient) + mlogn operations to go over the second list and sync it (or just MY) - hard to beat if you really have to use lists in the first place - it's also good you do it only once when and if you need it and it's way better then keeping the lists sorted all the time which would be a n^2 task for both of them.
看起来一个名叫 Michael Heyeck 的人有一个很好的O(n) 解决方案这个问题。查看该博客文章以获取解释和一些代码。
本质上,该解决方案一次性跟踪主列表和从列表,并跟踪每个列表的索引。然后管理两个数据结构:要在从属列表上重播的插入列表和删除列表。
它看起来很简单,而且还具有极简主义证明的优点,海耶克在 后续文章中跟进了这一点发布。这篇文章中的代码片段也更加紧凑:
再次感谢 Michael Heyeck。
It looks like a fellow named Michael Heyeck has a good, O(n) solution to this problem. Check out that blog post for an explanation and some code.
Essentially, the solution tracks both the master and slave lists in a single pass, tracking indices into each. Two data structures are then managed: a list of insertions to be replayed on the slave list, and a list of deletions.
It looks straightforward and also has the benefit of a proof of minimalism, which Heyeck followed up with in a subsequent post. The code snippet in this post is more compact, as well:
Again, credit to Michael Heyeck.
在 C++ STL 中,该算法称为 set_union。此外,如果您将并集合并到第三个列表中,则实现该算法可能会简单得多。
In the C++ STL the algorithm is called set_union. Also, implementing the algorithm is likely to be a lot simpler if you do the union into a 3rd list.
我过去在一个项目中遇到过这样的问题。
该项目有一个主数据源和多个独立更新数据的客户端,最终所有这些客户端都必须拥有最新且统一的数据集,即它们的总和。
我所做的是构建类似于 SVN 协议的东西,每次我想要更新主数据库(可通过 Web 服务访问)时,我都会获得修订号。将我的本地数据存储更新到该修订版,然后将任何修订版号未涵盖的实体提交到数据库。
每个客户端都能够将其本地数据存储更新到最新版本。
I had such problem in one project in the past.
That project had one master data source and several clients that update the data independently and in the end all of them have to have the latest and unified set of data that is the sum of them.
What I did was building something similar to the SVN protocol, in which every time I wanted to update the master database (which was accessible through a web service) I got the revision number. Updated my local data store to that revision and then commited the entities that aren't covered by any revision number to the database.
Every client has the ability to update it's local data store to the latest revision.
这是 Michael Heyek 的 Python 代码的 Javascript 版本。
Here is a Javascript version of Michael Heyek's python code.
非常暴力和纯粹的技术方法:
从您的 List 类继承(抱歉不知道您的语言是什么)。重写子列表类中的添加/删除方法。使用您的类而不是基础类。现在您可以使用自己的方法跟踪更改并在线同步列表。
Very brute-force and pure technical approach:
Inherit from your List class (sorry don't know what is your language). Override add/remove methods in your child list class. Use your class instead of the base one. Now you can track changes with your own methods and synchronize lists on-line.