保证唯一的代理键分配 - 非二分图的最大匹配
我正在维护一个数据仓库,其中包含有关必须合并的一类实体的多个数据源。 每个源都有一个自然键,并且应该发生的是始终为每个自然键创建一个且唯一的代理键。 如果来自一个源系统的具有特定自然键的一条记录与来自另一个源系统的具有不同自然键的另一条记录表示相同的实体,则相同的代理键将分配给两者。
换句话说,如果源系统 A 的自然键 ABC 代表与源系统 B 的自然键 DEF 相同的实体,我们将为两者分配相同的代理键。 表格看起来像这样:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
这就是计划。 然而,这个系统已经投入生产一段时间了,代理键分配一团糟。 有一天,源系统 A 会在源系统 B 知道之前给出自然密钥 ABC。 DW 为其分配了代理键 1。 然后源系统B开始给出自然密钥DEF,它代表与源系统A的自然密钥ABC相同的事物。 DW 错误地给出了这个组合代理键 2。该表将如下所示:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
所以仓库一团糟。 还有比这更复杂的情况。 我的清理时间很短,需要找出一组干净的自然键映射的代理键。
稍微谷歌搜索一下就会发现,这可以建模为非二分图中的匹配问题:
我需要一个易于理解的 Edmond 路径、树和花算法的实现(不是最佳执行)。 我没有正式的数学或计算机背景,我所拥有的都是自学的,而且今晚我并没有沉浸在数学的头脑中。 有人可以帮忙吗? 如果有一个写得好的解释能够指导我实现,我将不胜感激。
编辑:
数学方法是最佳的,因为我们希望最大化全局适应度。 贪婪方法(首先获取 A 的所有实例,然后是 B,然后是 C...)会将您绘制到局部最大值角。
无论如何,我把这件事推给了业务分析师手动完成(总共 2000 万)。 我正在帮助他们评估全球比赛质量的功能。 这是理想的选择,因为无论如何他们都是签字的人,所以我的背面被覆盖了。
不使用代理键不会改变匹配问题。 仍然需要发现和维护 1:1 的自然键映射。 代理键只是一个方便的锚点,仅此而已。
I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.
In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.
A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:
MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching
I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.
EDIT:
A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.
In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.
Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我觉得你处理这件事的方式是错误的; 正如 cdonner 所说,还有其他方法可以重建关键结构,而无需经历这种混乱。 特别是,您需要保证给定记录的自然键始终是唯一的(违反此条件会让您陷入困境!)。 让
ABC
和DEF
识别同一条记录是灾难性的,但最终是可以修复的。 我什至不确定为什么你需要代理键; 虽然它们确实有很多优点,但我会考虑采用纯关系型,并将它们从您的架构中剔除,就像 la Celko 那样; 它可能会让你摆脱困境。 但这是必须在查看整个架构后做出的决定。为了解决您可能的解决方案,我拿出了 DB West 的图论导论,第二版,其中在第 144 页描述了开花算法。您需要一些数学背景,包括数学符号和图论,以遵循算法,但它足够简洁,我认为它可以提供帮助(如果您决定走这条路)。 如果您需要解释,请首先查阅图论资源(维基百科、您当地的图书馆、谷歌等),或者询问您是否找不到所需的内容。
此处描述的算法运行时间为 O(n^4),其中 n 是顶点数。 West 给出了运行速度为 O(n^5/2) 或 O(n^1/2 m)(m 是边数)的版本的参考。 如果你想要这些参考资料,或者对埃德蒙兹原始论文的引用,只要提出要求,我就会从索引中找出它们(这在这本书中很糟糕)。
I get the impression you're going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both
ABC
andDEF
identify the same record is disastrous, but ultimately repairable. I'm not even sure why you need surrogate keys at all; while they do have many advantages, I'd give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that's a decision that would have to be made after looking at your whole schema.To address your potential solution, I've pulled out my copy of D. B. West's Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You'll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it's sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you're not finding what you need.
The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds' original paper, just ask and I'll dig them out of the index (which kind of sucks in this book).
我认为您最好建立一组规则,并使用一组以迭代方式强制执行每个规则的简单查询来攻击您的键映射表。 也许我过于简单化了,因为你的例子很简单。
以下是规则示例 - 只有您可以决定应用哪些规则:
一旦建立了规则,编写重建键映射的查询就很简单了。 我不确定这怎么可能是一道数学问题?
I think you would be better off by establishing a set of rules and attacking your key mapping table with a set of simple queries that enforce each rule, in an iterative fashion. Maybe I am oversimplifying because your example is simple.
The following are examples of rules - only you can decide which ones apply:
Writing queries that rebuild your key mapping is trivial, once you have established the rules. I am not sure how this could be a math problem?
如果您正在寻找实现,Eppsteins PADS 库有一个匹配算法,这对于您的目的来说应该足够快,一般匹配算法位于 CardinalityMatching.py 中。 实现中的注释解释了发生了什么。 该库很容易使用,要在 Python 中提供图,您可以使用字典 G 表示图,这样 G[v] 给出顶点 v 的邻居列表(或集合)。
示例:
给出一个线图4 个顶点。
If you are looking for an implementation, Eppsteins PADS library has a matching algorithm, this should be fast enough for your purposes, the general matching algorithm is in CardinalityMatching.py. The comments in the implementation explain what is going on. The library is easy to use, to supply a graph in Python you can represent the graph using a dictionary G, such that G[v] gives a list (or set) of neighbors of the vertex v.
Example:
gives a line graph with 4 vertices.