保证唯一的代理键分配 - 非二分图的最大匹配

发布于 2024-07-13 22:39:56 字数 1460 浏览 13 评论 0原文

我正在维护一个数据仓库，其中包含有关必须合并的一类实体的多个数据源。每个源都有一个自然键，并且应该发生的是始终为每个自然键创建一个且唯一的代理键。如果来自一个源系统的具有特定自然键的一条记录与来自另一个源系统的具有不同自然键的另一条记录表示相同的实体，则相同的代理键将分配给两者。

换句话说，如果源系统 A 的自然键 ABC 代表与源系统 B 的自然键 DEF 相同的实体，我们将为两者分配相同的代理键。表格看起来像这样：

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY
 1               ABC                     DEF

这就是计划。然而，这个系统已经投入生产一段时间了，代理键分配一团糟。有一天，源系统 A 会在源系统 B 知道之前给出自然密钥 ABC。 DW 为其分配了代理键 1。然后源系统B开始给出自然密钥DEF，它代表与源系统A的自然密钥ABC相同的事物。 DW 错误地给出了这个组合代理键 2。该表将如下所示：

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY
 1               ABC                     NULL
 2               ABC                     DEF

所以仓库一团糟。还有比这更复杂的情况。我的清理时间很短，需要找出一组干净的自然键映射的代理键。

稍微谷歌搜索一下就会发现，这可以建模为非二分图中的匹配问题：

维基百科 - 匹配< /a>

MIT 18.433 组合优化 - 非二分匹配讲义

我需要一个易于理解的 Edmond 路径、树和花算法的实现（不是最佳执行）。我没有正式的数学或计算机背景，我所拥有的都是自学的，而且今晚我并没有沉浸在数学的头脑中。有人可以帮忙吗？如果有一个写得好的解释能够指导我实现，我将不胜感激。

编辑：

数学方法是最佳的，因为我们希望最大化全局适应度。贪婪方法（首先获取 A 的所有实例，然后是 B，然后是 C...）会将您绘制到局部最大值角。

无论如何，我把这件事推给了业务分析师手动完成（总共 2000 万）。我正在帮助他们评估全球比赛质量的功能。这是理想的选择，因为无论如何他们都是签字的人，所以我的背面被覆盖了。

不使用代理键不会改变匹配问题。仍然需要发现和维护 1:1 的自然键映射。代理键只是一个方便的锚点，仅此而已。

原文

I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.

In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY
 1               ABC                     DEF

That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:

SURROGATE_KEY   SOURCE_A_NATURAL_KEY    SOURCE_B_NATURAL_KEY
 1               ABC                     NULL
 2               ABC                     DEF

So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.

A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:

Wikipedia - Matching

MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching

I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.

EDIT:

A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.

In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.

Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森林散布 2024-07-20 22:39:56

我觉得你处理这件事的方式是错误的；正如 cdonner 所说，还有其他方法可以重建关键结构，而无需经历这种混乱。特别是，您需要保证给定记录的自然键始终是唯一的（违反此条件会让您陷入困境！）。让 ABC 和 DEF 识别同一条记录是灾难性的，但最终是可以修复的。我什至不确定为什么你需要代理键；虽然它们确实有很多优点，但我会考虑采用纯关系型，并将它们从您的架构中剔除，就像 la Celko 那样；它可能会让你摆脱困境。但这是必须在查看整个架构后做出的决定。

为了解决您可能的解决方案，我拿出了 DB West 的图论导论，第二版，其中在第 144 页描述了开花算法。您需要一些数学背景，包括数学符号和图论，以遵循算法，但它足够简洁，我认为它可以提供帮助（如果您决定走这条路）。如果您需要解释，请首先查阅图论资源（维基百科、您当地的图书馆、谷歌等），或者询问您是否找不到所需的内容。

3.3.17。算法。（Edmonds 的 Blossom 算法 [1965a]---草图）。
输入。图形G、G中的匹配M、M code>-不饱和顶点u。
想法。探索M-从u开始的交替路径，记录每个顶点到达它的顶点，并在何时收缩花朵成立。维护类似于算法 3.2.1 中的集合 S 和 T，其中 S 由 u 和顶点组成沿着饱和边缘到达。到达不饱和顶点会产生增强。
初始化。 S = {u} 和 T = {}（空集）。
迭代。如果S没有未标记的顶点，则停止；没有来自 u 的 M 增强路径。否则，请在 S 中选择未标记的 v。要从 v 开始探索，请连续考虑 N(v) 中的每个 y，使得 y 不在 中>T.
如果 y 未被 m 饱和，则从 y 回溯（根据需要展开花朵）以报告 M< /code>-增强(u, y)-路径。
如果y在S中，那么就发现了一朵花。暂停对 v 的探索并收缩花朵，用 S 中的单个新顶点替换其在 S 和 T 中的顶点代码>. 在较小的图中从该顶点继续搜索。否则，y 会通过 M 与某个 w 相匹配。在 T 中包含 y（从 v 到达），并在 S 中包含 w （从 y 到达）。探索完 v 的所有此类邻居后，标记 v 并进行迭代。

此处描述的算法运行时间为 O(n^4)，其中 n 是顶点数。 West 给出了运行速度为 O(n^5/2) 或 O(n^1/2 m)（m 是边数）的版本的参考。如果你想要这些参考资料，或者对埃德蒙兹原始论文的引用，只要提出要求，我就会从索引中找出它们（这在这本书中很糟糕）。

I get the impression you're going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both ABC and DEF identify the same record is disastrous, but ultimately repairable. I'm not even sure why you need surrogate keys at all; while they do have many advantages, I'd give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that's a decision that would have to be made after looking at your whole schema.

To address your potential solution, I've pulled out my copy of D. B. West's Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You'll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it's sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you're not finding what you need.

3.3.17. Algorithm. (Edmonds' Blossom Algorithm [1965a]---sketch).
Input. A graph G, a matching M in G, an M-unsaturated vertex u.
Idea. Explore M-alternating paths from u, recording for each vertex the vertex from which it was reached, and contracting blossoms when found. Maintain sets S and T analogous to those in Algorithm 3.2.1, with S consisting of u and the vertices reached along saturated edges. Reaching an unsaturated vertex yields an augmentation.
Initialization. S = {u} and T = {} (empty set).
Iteration. If S has no unmarked vertex, stop; there is no M-augmenting path from u. Otherwise, select an unmarked v in S. To explore from v, successively consider each y in N(v) such that y is not in T.
If y is unsaturated by m, then trace back from y (expanding blossoms as needed) to report an M-augmenting (u, y)-path.
If y is in S, then a blossom has been found. Suspend the exploration of v and contract the blossom, replacing its vertices in S and T by a single new vertex in S. Continue the search from this vertex in the smaller graph.
Otherwise, y is matched to some w by M. Include y in T (reached from v), and include w in S (reached from y).
After exploring all such neighbors of v, mark v and iterate.

The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds' original paper, just ask and I'll dig them out of the index (which kind of sucks in this book).

回复收藏 0 原文

迷荒 2024-07-20 22:39:56

我认为您最好建立一组规则，并使用一组以迭代方式强制执行每个规则的简单查询来攻击您的键映射表。也许我过于简单化了，因为你的例子很简单。

以下是规则示例 - 只有您可以决定应用哪些规则：

如果存在重复项，则使用最低（最旧）的代理键
使用具有最高（最新）代理键的行中的自然键
使用最旧的代理键中的自然键完整的映射行
使用每个自然键最近出现的情况
...？

一旦建立了规则，编写重建键映射的查询就很简单了。我不确定这怎么可能是一道数学问题？

回复收藏 0 原文

十级心震 2024-07-20 22:39:56

如果您正在寻找实现，Eppsteins PADS 库有一个匹配算法，这对于您的目的来说应该足够快，一般匹配算法位于 CardinalityMatching.py 中。实现中的注释解释了发生了什么。该库很容易使用，要在 Python 中提供图，您可以使用字典 G 表示图，这样 G[v] 给出顶点 v 的邻居列表（或集合）。

示例：

G = {1: [1], 2:[1,3], 3: [2,4], 4:[3]}

给出一个线图4 个顶点。

If you are looking for an implementation, Eppsteins PADS library has a matching algorithm, this should be fast enough for your purposes, the general matching algorithm is in CardinalityMatching.py. The comments in the implementation explain what is going on. The library is easy to use, to supply a graph in Python you can represent the graph using a dictionary G, such that G[v] gives a list (or set) of neighbors of the vertex v.

Example: