同步两个相关对象列表的标准算法是什么？

发布于 2024-08-15 22:22:17 字数 1240 浏览 4 评论 0原文

我很确定这一定是在某种教科书中（或更可能在所有教科书中），但我似乎使用了错误的关键字来搜索它......:(

我在编程时面临的一个重复任务是我正在处理来自不同来源的对象列表，我需要以某种方式保持同步，通常有某种“主列表”，例如由某些外部 API 返回，然后是我自己创建的每个对象对应的列表。主列表中的对象（认为“包装器”或“适配器” - 它们通常包含有关特定于我的应用程序的外部对象的扩展信息和/或简化对

问题的所有实例的访问）：

主列表的实现对我来说是隐藏的；它的接口是固定的
两个列表中的元素不兼容分配
我可以完全控制从列表的实现
我无法控制主列表中元素的顺序（即它不可排序）
主列表要么根本不提供有关添加或删除元素的通知，要么通知不可靠，即同步只能按需发生，而不是
实时，只要需要就可以简单地从头开始清除和重建从属列表。一个选项：
- 初始化包装对象应该被认为是昂贵的
- 其他对象将保存对包装器的引用

在问题的某些实例中的附加特征：

主列表中的元素只能通过读取其属性来识别，而不是通过索引或直接访问它们内存地址：
- 刷新后，主列表可能会返回一组全新的实例，即使它们仍然表示相同的信息
- 访问主列表中元素的唯一接口可能是顺序枚举器
大多数情况下，，主列表中的元素是稳定的，即新元素总是添加在开头或结尾，而不是中间；然而，删除通常可以发生在任何位置

那么我通常会如何解决这个问题？我应该用谷歌搜索的算法的名称是什么？

过去，我以各种方式实现了这一点（参见下面的示例），但总觉得应该有一种更干净、更有效的方法，尤其是不需要两次迭代的方法（每个列表一次）。

下面是一个示例方法：

迭代主列表
查找“从列表”中的每个项目
添加尚不存在的项目
以某种方式跟踪两个列表中已存在的项目（例如，通过标记它们或保留另一个列表）
当完成后，迭代从属列表并删除所有未标记的对象（参见 4.），并再次从所有其他对象中清除标记

更新 1 感谢您迄今为止的所有回复！我需要一些时间来查看链接。
[...] （文本移至问题主体）

更新 2 将中间段落重组为（希望）更容易解析的项目符号列表，并合并了第一次更新中稍后添加的详细信息。

原文

I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(

A recurring task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list (think "wrappers" or "adapters" - they typically contain extended information about the external objects specific to my application and/or they simplify access to the external objects).

Hard characteristics of all instances of the problem:

the implementation of the master list is hidden from me; its interface is fixed
the elements in the two lists are not assignment-compatible
I have full control over the implementation of the slave list
I cannot control the order of elements in the master list (i.e. it's not sortable)
the master list does either not provide notification about added or removed elements at all or notification is unreliable, i.e. the sync can only happen on-demand, not live
simply clearing and rebuilding the slave list from scratch whenever it's needed is not an option:
- initializing the wrapper objects should be considered expensive
- other objects will hold references to the wrappers

Additional characteristics in some instances of the problem:

elements in the master list can only be identified by reading their properties rather than accessing them directly by index or memory address:
- after a refresh, the master list might return a completely new set of instances even though they still represent the same information
- the only interface for accessing elements in the master list might be a sequential enumerator
most of the time, the order of elements in the master list is stable, i.e. new elements are always added either at the beginning or at the end, never in the middle; however, deletion can usually occur at any position

So how would I typically tackle this? What's the name of the algorithm I should google for?

In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way, especially one that did not require two iterations (one over each list).

Here's an example approach:

Iterate over the master list
Look up each item in the "slave list"
Add items that do not yet exist
Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
When done, iterate over the slave list and remove all objects that have not been tagged (see 4.) and clear the tag again from all others

Update 1
Thanks for all your responses so far! I will need some time to look at the links.
[...] (text moved to main body of question)

Update 2
Restructered the middle-paragraph into a (hopefully) more easily parseable bullet lists and incorporated details added later in the first update.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

叫嚣ゝ 2024-08-22 22:22:17

2种典型的解决方案是：
1. 将主列表复制到同步列表。
2. 在所有元素对之间进行 O(N*N) 比较。

您已经排除了智能选项：共享身份、排序和更改通知。

请注意，列表是否可以以有意义的方式排序，甚至完全排序，都无关紧要。例如，当比较两个字符串列表时，最好按字母顺序排序。但如果您按字符串长度对两个列表进行排序，列表比较仍然会更有效！您仍然需要对相同长度的字符串进行完整的成对比较，但这可能会少得多。

回复收藏 0 原文

厌味 2024-08-22 22:22:17

这看起来像集合协调问题，即同步无序数据的问题。有人就此提出了有关SO的问题：设置协调算法的实现。

谷歌上的大部分参考文献都是技术论文摘要。

回复收藏 0 原文

演出会有结束 2024-08-22 22:22:17

通常，解决此类问题的最佳方法是不直接解决它们。

如果您确实无法在代码部分使用排序的二进制可搜索容器（例如集合甚至排序的向量），那么...

您的内存非常有限吗？如果没有，那么我只需创建一个包含其中一个列表内容的字典（例如 std::set），然后迭代第二个列表，我希望与第一个列表同步。

这样，您可以执行 nlogn 来创建字典（或者 nX 来创建哈希字典，具体取决于哪一个更有效）+ mlogn 操作来遍历第二个列表并同步它（或者只是 MY） - 如果你真的必须首先使用列表，那么很难被击败 - 如果你需要它，你只做一次也很好，这比保持列表排序要好得多一直以来，这对他们俩来说都是一项 ^2 任务。

回复收藏 0 原文

菩提树下叶撕阳。 2024-08-22 22:22:17

看起来一个名叫 Michael Heyeck 的人有一个很好的O(n) 解决方案这个问题。查看该博客文章以获取解释和一些代码。

本质上，该解决方案一次性跟踪主列表和从列表，并跟踪每个列表的索引。然后管理两个数据结构：要在从属列表上重播的插入列表和删除列表。

它看起来很简单，而且还具有极简主义证明的优点，海耶克在后续文章中跟进了这一点发布。这篇文章中的代码片段也更加紧凑：

def sync_ordered_list(a, b):
x = 0; y = 0; i = []; d = []
while (x < len(a)) or (y < len(b)):
    if y >= len(b): d.append(x); x += 1
    elif x >= len(a): i.append((y, b[y])); y += 1
    elif a[x] < b[y]: d.append(x); x += 1
    elif a[x] > b[y]: i.append((y, b[y])); y += 1
    else: x += 1; y += 1
return (i,d)

再次感谢 Michael Heyeck。

It looks like a fellow named Michael Heyeck has a good, O(n) solution to this problem. Check out that blog post for an explanation and some code.

Essentially, the solution tracks both the master and slave lists in a single pass, tracking indices into each. Two data structures are then managed: a list of insertions to be replayed on the slave list, and a list of deletions.

It looks straightforward and also has the benefit of a proof of minimalism, which Heyeck followed up with in a subsequent post. The code snippet in this post is more compact, as well:

def sync_ordered_list(a, b):
x = 0; y = 0; i = []; d = []
while (x < len(a)) or (y < len(b)):
    if y >= len(b): d.append(x); x += 1
    elif x >= len(a): i.append((y, b[y])); y += 1
    elif a[x] < b[y]: d.append(x); x += 1
    elif a[x] > b[y]: i.append((y, b[y])); y += 1
    else: x += 1; y += 1
return (i,d)

Again, credit to Michael Heyeck.

回复收藏 0 原文

不弃不离 2024-08-22 22:22:17

在 C++ STL 中，该算法称为 set_union。此外，如果您将并集合并到第三个列表中，则实现该算法可能会简单得多。

回复收藏 0 原文

风吹雪碎 2024-08-22 22:22:17

我过去在一个项目中遇到过这样的问题。

该项目有一个主数据源和多个独立更新数据的客户端，最终所有这些客户端都必须拥有最新且统一的数据集，即它们的总和。

我所做的是构建类似于 SVN 协议的东西，每次我想要更新主数据库（可通过 Web 服务访问）时，我都会获得修订号。将我的本地数据存储更新到该修订版，然后将任何修订版号未涵盖的实体提交到数据库。

每个客户端都能够将其本地数据存储更新到最新版本。

回复收藏 0 原文

浴红衣 2024-08-22 22:22:17

这是 Michael Heyek 的 Python 代码的 Javascript 版本。

    var b= [1,3,8,12,16,19,22,24,26]; // new situation
    var a = [1,2,8,9,19,22,23,26]; // previous situation
    var result = sync_ordered_lists(a,b);
console.log(result);
    function sync_ordered_lists(a,b){
// by Michael Heyeck see http://www.mlsite.net/blog/?p=2250
// a is the subject list
// b is the target list
// x is the "current position" in the subject list
// y is the "current position" in the target list
// i is the list of inserts
// d is the list of deletes
        var x = 0; 
        var y = 0; 
        var i = []; 
        var d = []; 
        var acc = {}; // object containing inserts and deletes arrays
        while (x < a.length || y < b.length) {
            if (y >= b.length){
                d.push(x); 
                x++;
            } else if (x >= a.length){ 
                i.push([y, b[y]]); 
                y++;
            } else if (a[x] < b[y]){ 
                d.push(x); 
                x++;
            } else if (a[x] > b[y]){ 
                i.push([y, b[y]]); 
                y++;
            } else { 
                x++; y++;
            }
        }
        acc.inserts = i;
        acc.deletes = d;
        return acc;
    }

Here is a Javascript version of Michael Heyek's python code.

    var b= [1,3,8,12,16,19,22,24,26]; // new situation
    var a = [1,2,8,9,19,22,23,26]; // previous situation
    var result = sync_ordered_lists(a,b);
console.log(result);
    function sync_ordered_lists(a,b){
// by Michael Heyeck see http://www.mlsite.net/blog/?p=2250
// a is the subject list
// b is the target list
// x is the "current position" in the subject list
// y is the "current position" in the target list
// i is the list of inserts
// d is the list of deletes
        var x = 0; 
        var y = 0; 
        var i = []; 
        var d = []; 
        var acc = {}; // object containing inserts and deletes arrays
        while (x < a.length || y < b.length) {
            if (y >= b.length){
                d.push(x); 
                x++;
            } else if (x >= a.length){ 
                i.push([y, b[y]]); 
                y++;
            } else if (a[x] < b[y]){ 
                d.push(x); 
                x++;
            } else if (a[x] > b[y]){ 
                i.push([y, b[y]]); 
                y++;
            } else { 
                x++; y++;
            }
        }
        acc.inserts = i;
        acc.deletes = d;
        return acc;
    }

回复收藏 0 原文