家谱算法
我正在为入门级 CS 课程整理一个问题集,并提出了一个表面上看起来非常简单的问题:
您将获得一份人员名单,其中包含其父母的姓名、出生日期和死亡日期。您有兴趣找出谁在一生中的某个时刻是父母、祖父母、曾祖父母等。设计一种算法,用此信息将每个人标记为整数(0 表示该人从未有过孩子,1 表示该人是父母,2 表示该人是祖父母等)
为了简单起见,您可以假设家庭图是一个 DAG,其无向版本是一棵树。
这里有趣的挑战是,您不能仅查看树的形状来确定此信息。例如,我有8位曾曾祖父母,但由于我出生时他们都已不在人世,所以他们一生中都不是曾曾祖父母。
我能想到的解决这个问题的最佳算法运行时间为 O(n2),其中 n 是人数。这个想法很简单——从每个人开始进行 DFS,找到家谱中在该人死亡日期之前出生的最远的后代。但是,我很确定这不是问题的最佳解决方案。例如,如果图只有两个父母和他们的 n 个孩子,那么问题可以在 O(n) 内轻松解决。我所希望的是某种算法,要么击败 O(n2),要么运行时间根据图的形状进行参数化,使其能够快速处理宽图,并优雅地降级为 O(最坏情况下为 n2)。
I'm working on putting together a problem set for an intro-level CS course and came up with a question that, on the surface, seems very simple:
You are given a list of people with the names of their parents, their birth dates, and their death dates. You are interested in finding out who, at some point in their lifetime, was a parent, a grandparent, a great-grandparent, etc. Devise an algorithm to label each person with this information as an integer (0 means the person never had a child, 1 means that the person was a parent, 2 means that the person was a grandparent, etc.)
For simplicity, you can assume that the family graph is a DAG whose undirected version is a tree.
The interesting challenge here is that you can't just look at the shape of the tree to determine this information. For example, I have 8 great-great-grandparents, but since none of them were alive when I was born, in their lifetimes none of them were great-great-grandparents.
The best algorithm I can come up with for this problem runs in time O(n2), where n is the number of people. The idea is simple - start a DFS from each person, finding the furthest descendant down in the family tree that was born before that person's death date. However, I'm pretty sure that this is not the optimal solution to the problem. For example, if the graph is just two parents and their n children, then the problem can be solved trivially in O(n). What I'm hoping for is some algorithm that is either beats O(n2) or whose runtime is parameterized over the shape of the graph that makes it fast for wide graphs with a graceful degradation to O(n2) in the worst-case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
以下是一种 O(n log n) 算法,适用于每个子项最多有一个父项的图(编辑:该算法不能扩展到具有 O(n log n) 性能的双父项情况)。值得注意的是,我相信通过额外的工作可以将性能提高到 O(n log(max level label))。
一个父案例:
对于每个节点 x,按照逆拓扑顺序,创建一棵二叉搜索树 T_x,该树的出生日期和从 x 移除的代数均严格递增。 (T_x 包含以 x 为根的祖先图的子图中的第一个出生的孩子 c1,以及该子图中的下一个最早出生的孩子 c2,使得 c2 的“曾祖父母级别”严格大于 c1 的级别,并且该子图中下一个最早出生的孩子 c3,使得 c3 的级别严格大于 c2 的级别,依此类推。)为了创建 T_x,我们合并先前构造的树 T_w,其中 w 是x 的子级(它们是先前构造的,因为我们以反向拓扑顺序迭代)。
如果我们仔细考虑如何执行合并,我们可以证明对于整个祖先图,此类合并的总成本为 O(n log n)。关键思想是要注意每次合并后,每一层最多有一个节点在合并树中存活。我们将每棵树 T_w 与 h(w) log n 的势相关联,其中 h(w) 等于从 w 到叶子的最长路径的长度。
当我们合并子树 T_w 以创建 T_x 时,我们“销毁”了所有树 T_w,释放了它们存储的用于构建树 T_x 的所有潜力;我们创建一棵具有 (log n)(h(x)) 势的新树 T_x。因此,我们的目标是最多花费 O((log n)(sum_w(h(w)) - h(x) + Constant)) 时间从树 T_w 创建 T_x,以便合并的摊销成本为只需要 O(log n)。这可以通过选择树 T_w 使得 h(w) 最大作为 T_x 的起点,然后修改 T_w 以创建 T_x 来实现。在对 T_x 做出这样的选择之后,我们使用类似于合并两个二叉搜索树的标准算法的算法将其他每一棵树一一合并到 T_x 中。
本质上,合并是通过迭代 T_w 中的每个节点 y 来完成的,按出生日期搜索 y 的前驱 z,然后如果从 x 移除的层数多于 z,则将 y 插入到 T_x 中;然后,如果z被插入到T_x中,我们在T_x中搜索严格大于z的级别的最低级别的节点,并拼接出中间节点以维持T_x严格按出生日期和级别排序的不变量。 T_w 中每个节点的成本为 O(log n),并且 T_w 中最多有 O(h(w)) 个节点,因此合并所有树的总成本为 O((log n)(sum_w(h(w) ))),对除子 w' 之外的所有子 w 求和,使得 h(w') 最大。
我们将与 T_x 的每个元素关联的级别存储在树中每个节点的辅助字段中。这样我们就可以计算出一旦我们构建了 T_x,就可以得到 x 的实际级别(作为一个技术细节,我们实际上将每个节点的级别与其父节点的级别之差存储在 T_x 中,以便我们可以快速增加树中所有节点的值。这是一个标准的 BST 技巧。)
就是这样,我们只需注意初始势为 0,最终势为正,因此摊余界限的总和是整个树上所有合并的总成本的上限。找到标签一旦我们通过二进制搜索 T_x 中在 x 死亡之前诞生的最新元素创建 BST T_x,成本为 O(log n),
则将边界提高到 O(n log(max level label)) 。可以延迟合并树,只根据需要合并树的前几个元素,以为当前节点提供解决方案。如果您使用利用引用局部性的 BST(例如展开树),那么您可以实现上述限制。
希望上面的算法和分析至少足够清晰可以理解。如果您需要任何澄清,请发表评论。
The following is an O(n log n) algorithm that work for graphs in which each child has at most one parent (EDIT: this algorithm does not extend to the two-parent case with O(n log n) performance). It is worth noting that I believe the performance can be improved to O(n log(max level label)) with extra work.
One parent case:
For each node x, in reverse topological order, create a binary search tree T_x that is strictly increasing both in date of birth and in number of generations removed from x. (T_x contains the first born child c1 in the subgraph of the ancestry graph rooted at x, along with the next earliest born child c2 in this subgraph such that c2's 'great grandparent level' is a strictly greater than that of c1, along with the next earliest born child c3 in this subgraph such that c3's level is strictly greater than that of c2, etc.) To create T_x, we merge the previously-constructed trees T_w where w is a child of x (they are previously-constructed because we are iterating in reverse topological order).
If we are careful with how we perform the merges, we can show that the total cost of such merges is O(n log n) for the entire ancestry graph. The key idea is to note that after each merge, at most one node of each level survives in the merged tree. We associate with each tree T_w a potential of h(w) log n, where h(w) is equal to the length of the longest path from w to a leaf.
When we merge the child trees T_w to create T_x, we 'destroy' all of the trees T_w, releasing all of the potential that they store for use in building the tree T_x; and we create a new tree T_x with (log n)(h(x)) potential. Thus, our goal is to spend at most O((log n)(sum_w(h(w)) - h(x) + constant)) time to create T_x from the trees T_w so that the amortized cost of the merge will be only O(log n). This can be achieved by choosing the tree T_w such that h(w) is maximal as a starting point for T_x and then modifying T_w to create T_x. After such a choice is made for T_x, we merge each of the other trees, one by one, into T_x with an algorithm that is similar to the standard algorithm for merging two binary search trees.
Essentially, the merging is accomplished by iterating over each node y in T_w, searching for y's predecessor z by birth date, and then inserting y into T_x if it is more levels removed from x than z; then, if z was inserted into T_x, we search for the node in T_x of the lowest level that is strictly greater than z's level, and splice out the intervening nodes to maintain the invariant that T_x is ordered strictly both by birth date and level. This costs O(log n) for each node in T_w, and there are at most O(h(w)) nodes in T_w, so the total cost of merging all trees is O((log n)(sum_w(h(w))), summing over all children w except for the child w' such that h(w') is maximal.
We store the level associated with each element of T_x in an auxiliary field of each node in the tree. We need this value so that we can figure out the actual level of x once we've constructed T_x. (As a technical detail, we actually store the difference of each node's level with that of its parent in T_x so that we can quickly increment the values for all nodes in the tree. This is a standard BST trick.)
That's it. We simply note that the initial potential is 0 and the final potential is positive so the sum of the amortized bounds is an upper bound on the total cost of all merges across the entire tree. We find the label of each node x once we create the BST T_x by binary searching for the latest element in T_x that was born before x died at cost O(log n).
To improve the bound to O(n log(max level label)), you can lazily merge the trees, only merging the first few elements of the tree as necessary to provide the solution for the current node. If you use a BST that exploits locality of reference, such as a splay tree, then you can achieve the above bound.
Hopefully, the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.
我有一种预感,为每个人获取一个映射(一代 -> 该一代中第一个后代的出生日期)会有所帮助。
由于日期必须严格递增,因此我们可以使用二分搜索(或简洁的数据结构)在 O(log n) 时间内找到最远的活着的后代。
问题是,合并这些列表(至少天真地)是 O(代数),因此在最坏的情况下这可能是 O(n^2)(考虑 A 和 B 是 C 和 D 的父母,C 和 D 是父母) E 和 F...)。
我仍然需要弄清楚最好的情况是如何工作的,并尝试更好地识别最坏的情况(并看看是否有解决方法)
I have a hunch that obtaining for each person a mapping (generation -> date the first descendant in that generation is born) would help.
Since the dates must be strictly increasing, we would be able to use use binary search (or a neat datastructure) to find the most distant living descendant in O(log n) time.
The problem is that merging these lists (at least naively) is O(number of generations) so this could get to be O(n^2) in the worst case (consider A and B are parents of C and D, who are parents of E and F...).
I still have to work out how the best case works and try to identify the worst cases better (and see if there is a workaround for them)
我们最近在我们的一个项目中实现了关系模块,其中我们将所有内容都存储在数据库中,是的,我认为算法最好是 2nO(m)(m 是最大分支因子)。我将操作乘以 N 两次,因为在第一轮中我们创建关系图,在第二轮中我们访问每个人。我们存储了每两个节点之间的双向关系。导航时,我们只使用一个方向行驶。但是我们有两组操作,一组仅遍历子项,另一组仅遍历父项。
这种看起来像双向图。但在这种情况下,首先构建所有 Person 的列表,然后可以构建列表关系并在每个节点之间设置 FromRelations 和 ToRelations。那么您所要做的就是,对于每个人,您只需导航类型为 (Son,Daughter) 的 ToRelations。既然有了日期,你就可以计算一切。
我没有时间检查代码的正确性,但这会让您了解如何做到这一点。
We recently implemented relationship module in one of our project in which we had everything in database and yes I think algorithm was best 2nO(m) (m is max branch factor). I multiplied operations twice to N because in first round we create relationship graph and in second round we visit every Person. We have stored bidirectional relationship between every two nodes. While navigating, we only use one direction to travel. But we have two set of operations, one traverse only children, other traverse only parent.
This kind of looks like bidirectional graph. But in this case, first you build list of all Person, and then you can build list relations and setup FromRelations and ToRelations between each node. Then all you have to do is, for every Person, you have to only navigate ToRelations of type (Son,Daughter) only. And since you have date, you can calculate everything.
I dont have time to check correctness of the code, but this will give you idea of how to do it.
这是我的刺:
Here's my stab:
有一个相对简单的 O(n log n) 算法,可以在合适的 的帮助下按时间顺序扫描事件顶树。
你真的不应该布置你自己无法解决的作业。
There's a relatively straightforward O(n log n) algorithm that sweeps the events chronologically with the help of a suitable top tree.
You really shouldn't assign homework that you can't solve yourself.
更新:这不是我想出的最好的解决方案,但我已经离开了它,因为有很多与之相关的评论。
你有一组事件(出生/死亡)、父母状态(没有后代、父母、祖父母等)和生命状态(活着、死了)。
我会将数据存储在具有以下字段的结构中:
按日期对事件进行排序,然后对于每个事件采用以下两个逻辑过程之一:
最坏的情况是
O(n*n)
如果每个人都有很多活着的祖先。然而,一般来说,排序预处理步骤是O(n log(n))
,然后你就是O(n * 活着祖先的平均数量)
这意味着在大多数人群中,总时间往往是O(n log(n))
。 (我没有正确计算排序前置步骤,感谢@Alexey Kukanov 的更正。)Update: This is not the best solution I have come up with, but I've left it because there are so many comments relating to it.
You have a set of events (birth/death), parental state (no descendants, parent, grandparent, etc) and life state (alive, dead).
I would store my data in structures with the following fields:
Sort your events by date, and then for each event take one of the following two courses of logic:
The worst case is
O(n*n)
if everyone has a lot of living ancestors. However in general you've got the sorting preprocessing step which isO(n log(n))
and then you'reO(n * avg no of living ancestors)
which means that the total time tends to beO(n log(n))
in most populations. (I hadn't counted the sorting prestep properly, thanks to @Alexey Kukanov for the correction.)今天早上我想到了这一点,然后发现@Alexey Kukanov也有类似的想法。但我的更加充实并且有更多优化,所以无论如何我都会发布它。
该算法的复杂度为
O(n * (1 + Generation))
,适用于任何数据集。对于实际数据,这是O(n)
。O(n)
工作来初始化此数据。O(1)
。然后,对于对子项的每次递归调用,您需要执行O( Generations)
工作将该子项的数据合并到您的数据中。当您在数据结构中遇到每个人时,每个人都会被呼叫,并且可以从每个父级调用一次,每次O(n)
次调用,总费用O(n * ( Generations + 1))
。O(n * (generations + 1))
。所有这些操作的总和是
O(n * (generations + 1))
。对于实际数据集,这将是
O(n)
且常数相当小。I thought of this this morning, then found that @Alexey Kukanov had similar thoughts. But mine is more fleshed out and has some more optimization, so I'll post it anyways.
This algorithm is
O(n * (1 + generations))
, and will work for any dataset. For realistic data this isO(n)
.O(n)
work to initialize this data.O(1)
to calculate when you had 0 generations. Then for each recursive call to a child you need to doO(generations)
work to merge that child's data in to yours. Each person gets called when you encounter them in the data structure, and can be called once from each parent forO(n)
calls and total expenseO(n * (generations + 1))
.O(n * (generations + 1))
if implemented with a linear scan.The sum total of all of these operations is
O(n * (generations + 1))
.For realistic data sets, this will be
O(n)
with a fairly small constant.我的建议:
O(N)
。descendant_birthday[0]
中(如有必要,请增加该向量)。如果已设置此字段,则仅当新日期较早时才更改它。descendant_birthday[i]
日期,请遵循与上述相同的规则来更新parents 中的descendant_birthday[i+1]
记录。O(C*N)
,其中 C 是给定输入的“族深度”的最大值(即最长descendant_birthday< 的大小) /代码>矢量)。对于实际数据,它可以被一些合理的常数限制,而不会损失正确性(正如其他人已经指出的那样),因此不依赖于 N。
descendant_birthday[i]
仍早于死亡日期;也是O(C*N)
。因此,对于实际数据,可以在线性时间内找到问题的解决方案。尽管对于 @btilly 评论中建议的人为数据,C 可能很大,甚至在退化情况下约为 N 的数量级。它可以通过设置向量大小上限或通过使用 @btilly 解决方案的步骤 2 扩展算法来解决。
如果输入数据中的父子关系是通过名称提供的(如问题陈述中所写),则哈希表是解决方案的关键部分。如果没有哈希,则需要
O(N log N)
来构建关系图。大多数其他建议的解决方案似乎都假设关系图已经存在。My suggestion:
O(N)
.descendant_birthday[0]
for both parents (grow that vector if necessary). If this field is already set, change it only if the new date is earlier.descendant_birthday[i]
dates available in the vector of the current record, follow the same rule as above to updatedescendant_birthday[i+1]
in parents' records.O(C*N)
, with C being the biggest value of "family depth" for the given input (i.e. the size of the longestdescendant_birthday
vector). For realistic data it can be capped by some reasonable constant without correctness loss (as others already pointed out), and so does not depend on N.i
for whichdescendant_birthday[i]
is still earlier than the death date; alsoO(C*N)
.Thus for realistic data the solution for the problem can be found in linear time. Though for contrived data like suggested in @btilly's comment, C can be big, and even of the order of N in degenerate cases. It can be resolved either by putting a cap on the vector size or by extending the algorithm with step 2 of @btilly's solution.
A hash table is key part of the solution in case if parent-child relations in the input data are provided through names (as written in the problem statement). Without hashes, it would require
O(N log N)
to build a relation graph. Most other suggested solutions seem to assume that the relationship graph already exists.创建人员列表,按
出生日期
排序。创建另一个人员列表,按死亡日期
排序。您可以逻辑地穿越时间,从这些列表中删除人员,以获得事件发生时的列表。为每个 Person 定义一个
is_alive
字段。一开始这对每个人来说都是错误的。随着人们的出生和死亡,相应地更新此记录。为每个人定义另一个字段,称为
has_a_living_ancestor
,首先为每个人初始化为 FALSE。出生时,x.has_a_living_ancestor
将被设置为x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor
。因此,对于大多数人(但不是所有人)来说,这将在出生时设置为 TRUE。挑战在于确定
has_a_living_ancestor
可以设置为 FALSE 的情况。每次一个人出生时,我们都会对祖先进行 DFS,但仅限于那些ancestor.has_a_living_ancestor || 的祖先。 ancestor.is_alive
为 true。在 DFS 期间,如果我们发现一个祖先没有活着的祖先,并且现在已经死亡,那么我们可以将
has_a_living_ancestor
设置为 FALSE。我认为,这确实意味着有时has_a_living_ancestor
会过时,但希望很快就能被捕获。Create a list of people, sorted by
birth_date
. Create another list of people, sorted bydeath_date
. You can travel logically through time, popping people from these lists, in order to get a list of the events as they happened.For each Person, define an
is_alive
field. This'll be FALSE for everyone at first. As people are born and die, update this record accordingly.Define another field for each person, called
has_a_living_ancestor
, initialized to FALSE for everyone at first. At birth,x.has_a_living_ancestor
will be set tox.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor
. So, for most people (but not everyone), this will be set to TRUE at birth.The challenge is to identify occasions when
has_a_living_ancestor
can be set to FALSE. Each time a person is born, we do a DFS up through the ancestors, but only those ancestors for whichancestor.has_a_living_ancestor || ancestor.is_alive
is true.During that DFS, if we find an ancestor that has no living ancestors, and is now dead, then we can set
has_a_living_ancestor
to FALSE. This does mean, I think, that sometimeshas_a_living_ancestor
will be out of date, but it will hopefully be caught quickly.