家谱算法

发布于 2024-11-09 10:20:25 字数 584 浏览 0 评论 0原文

我正在为入门级 CS 课程整理一个问题集，并提出了一个表面上看起来非常简单的问题：

您将获得一份人员名单，其中包含其父母的姓名、出生日期和死亡日期。您有兴趣找出谁在一生中的某个时刻是父母、祖父母、曾祖父母等。设计一种算法，用此信息将每个人标记为整数（0 表示该人从未有过孩子，1 表示该人是父母，2 表示该人是祖父母等）

为了简单起见，您可以假设家庭图是一个 DAG，其无向版本是一棵树。

这里有趣的挑战是，您不能仅查看树的形状来确定此信息。例如，我有8位曾曾祖父母，但由于我出生时他们都已不在人世，所以他们一生中都不是曾曾祖父母。

我能想到的解决这个问题的最佳算法运行时间为 O(n²)，其中 n 是人数。这个想法很简单——从每个人开始进行 DFS，找到家谱中在该人死亡日期之前出生的最远的后代。但是，我很确定这不是问题的最佳解决方案。例如，如果图只有两个父母和他们的 n 个孩子，那么问题可以在 O(n) 内轻松解决。我所希望的是某种算法，要么击败 O(n²)，要么运行时间根据图的形状进行参数化，使其能够快速处理宽图，并优雅地降级为 O(最坏情况下为 n²）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一个人的旅程 2024-11-16 10:20:26

以下是一种 O(n log n) 算法，适用于每个子项最多有一个父项的图（编辑：该算法不能扩展到具有 O(n log n) 性能的双父项情况）。值得注意的是，我相信通过额外的工作可以将性能提高到 O(n log(max level label))。

一个父案例：

对于每个节点 x，按照逆拓扑顺序，创建一棵二叉搜索树 T_x，该树的出生日期和从 x 移除的代数均严格递增。（T_x 包含以 x 为根的祖先图的子图中的第一个出生的孩子 c1，以及该子图中的下一个最早出生的孩子 c2，使得 c2 的“曾祖父母级别”严格大于 c1 的级别，并且该子图中下一个最早出生的孩子 c3，使得 c3 的级别严格大于 c2 的级别，依此类推。）为了创建 T_x，我们合并先前构造的树 T_w，其中 w 是x 的子级（它们是先前构造的，因为我们以反向拓扑顺序迭代）。

如果我们仔细考虑如何执行合并，我们可以证明对于整个祖先图，此类合并的总成本为 O(n log n)。关键思想是要注意每次合并后，每一层最多有一个节点在合并树中存活。我们将每棵树 T_w 与 h(w) log n 的势相关联，其中 h(w) 等于从 w 到叶子的最长路径的长度。

当我们合并子树 T_w 以创建 T_x 时，我们“销毁”了所有树 T_w，释放了它们存储的用于构建树 T_x 的所有潜力；我们创建一棵具有 (log n)(h(x)) 势的新树 T_x。因此，我们的目标是最多花费 O((log n)(sum_w(h(w)) - h(x) + Constant)) 时间从树 T_w 创建 T_x，以便合并的摊销成本为只需要 O(log n)。这可以通过选择树 T_w 使得 h(w) 最大作为 T_x 的起点，然后修改 T_w 以创建 T_x 来实现。在对 T_x 做出这样的选择之后，我们使用类似于合并两个二叉搜索树的标准算法的算法将其他每一棵树一一合并到 T_x 中。

本质上，合并是通过迭代 T_w 中的每个节点 y 来完成的，按出生日期搜索 y 的前驱 z，然后如果从 x 移除的层数多于 z，则将 y 插入到 T_x 中；然后，如果z被插入到T_x中，我们在T_x中搜索严格大于z的级别的最低级别的节点，并拼接出中间节点以维持T_x严格按出生日期和级别排序的不变量。 T_w 中每个节点的成本为 O(log n)，并且 T_w 中最多有 O(h(w)) 个节点，因此合并所有树的总成本为 O((log n)(sum_w(h(w) )))，对除子 w' 之外的所有子 w 求和，使得 h(w') 最大。

我们将与 T_x 的每个元素关联的级别存储在树中每个节点的辅助字段中。这样我们就可以计算出一旦我们构建了 T_x，就可以得到 x 的实际级别（作为一个技术细节，我们实际上将每个节点的级别与其父节点的级别之差存储在 T_x 中，以便我们可以快速增加树中所有节点的值。这是一个标准的 BST 技巧。）

就是这样，我们只需注意初始势为 0，最终势为正，因此摊余界限的总和是整个树上所有合并的总成本的上限。找到标签一旦我们通过二进制搜索 T_x 中在 x 死亡之前诞生的最新元素创建 BST T_x，成本为 O(log n)，

则将边界提高到 O(n log(max level label)) 。可以延迟合并树，只根据需要合并树的前几个元素，以为当前节点提供解决方案。如果您使用利用引用局部性的 BST（例如展开树），那么您可以实现上述限制。

希望上面的算法和分析至少足够清晰可以理解。如果您需要任何澄清，请发表评论。

The following is an O(n log n) algorithm that work for graphs in which each child has at most one parent (EDIT: this algorithm does not extend to the two-parent case with O(n log n) performance). It is worth noting that I believe the performance can be improved to O(n log(max level label)) with extra work.

One parent case:

For each node x, in reverse topological order, create a binary search tree T_x that is strictly increasing both in date of birth and in number of generations removed from x. (T_x contains the first born child c1 in the subgraph of the ancestry graph rooted at x, along with the next earliest born child c2 in this subgraph such that c2's 'great grandparent level' is a strictly greater than that of c1, along with the next earliest born child c3 in this subgraph such that c3's level is strictly greater than that of c2, etc.) To create T_x, we merge the previously-constructed trees T_w where w is a child of x (they are previously-constructed because we are iterating in reverse topological order).

If we are careful with how we perform the merges, we can show that the total cost of such merges is O(n log n) for the entire ancestry graph. The key idea is to note that after each merge, at most one node of each level survives in the merged tree. We associate with each tree T_w a potential of h(w) log n, where h(w) is equal to the length of the longest path from w to a leaf.

When we merge the child trees T_w to create T_x, we 'destroy' all of the trees T_w, releasing all of the potential that they store for use in building the tree T_x; and we create a new tree T_x with (log n)(h(x)) potential. Thus, our goal is to spend at most O((log n)(sum_w(h(w)) - h(x) + constant)) time to create T_x from the trees T_w so that the amortized cost of the merge will be only O(log n). This can be achieved by choosing the tree T_w such that h(w) is maximal as a starting point for T_x and then modifying T_w to create T_x. After such a choice is made for T_x, we merge each of the other trees, one by one, into T_x with an algorithm that is similar to the standard algorithm for merging two binary search trees.

Essentially, the merging is accomplished by iterating over each node y in T_w, searching for y's predecessor z by birth date, and then inserting y into T_x if it is more levels removed from x than z; then, if z was inserted into T_x, we search for the node in T_x of the lowest level that is strictly greater than z's level, and splice out the intervening nodes to maintain the invariant that T_x is ordered strictly both by birth date and level. This costs O(log n) for each node in T_w, and there are at most O(h(w)) nodes in T_w, so the total cost of merging all trees is O((log n)(sum_w(h(w))), summing over all children w except for the child w' such that h(w') is maximal.

We store the level associated with each element of T_x in an auxiliary field of each node in the tree. We need this value so that we can figure out the actual level of x once we've constructed T_x. (As a technical detail, we actually store the difference of each node's level with that of its parent in T_x so that we can quickly increment the values for all nodes in the tree. This is a standard BST trick.)

That's it. We simply note that the initial potential is 0 and the final potential is positive so the sum of the amortized bounds is an upper bound on the total cost of all merges across the entire tree. We find the label of each node x once we create the BST T_x by binary searching for the latest element in T_x that was born before x died at cost O(log n).

To improve the bound to O(n log(max level label)), you can lazily merge the trees, only merging the first few elements of the tree as necessary to provide the solution for the current node. If you use a BST that exploits locality of reference, such as a splay tree, then you can achieve the above bound.

Hopefully, the above algorithm and analysis is at least clear enough to follow. Just comment if you need any clarification.

回复收藏 0 原文

三人与歌 2024-11-16 10:20:26

我有一种预感，为每个人获取一个映射（一代 -> 该一代中第一个后代的出生日期）会有所帮助。

由于日期必须严格递增，因此我们可以使用二分搜索（或简洁的数据结构）在 O(log n) 时间内找到最远的活着的后代。

问题是，合并这些列表（至少天真地）是 O（代数），因此在最坏的情况下这可能是 O(n^2)（考虑 A 和 B 是 C 和 D 的父母，C 和 D 是父母） E 和 F...）。

我仍然需要弄清楚最好的情况是如何工作的，并尝试更好地识别最坏的情况（并看看是否有解决方法）

回复收藏 0 原文

探春 2024-11-16 10:20:26

我们最近在我们的一个项目中实现了关系模块，其中我们将所有内容都存储在数据库中，是的，我认为算法最好是 2nO(m)（m 是最大分支因子）。我将操作乘以 N 两次，因为在第一轮中我们创建关系图，在第二轮中我们访问每个人。我们存储了每两个节点之间的双向关系。导航时，我们只使用一个方向行驶。但是我们有两组操作，一组仅遍历子项，另一组仅遍历父项。

Person{
  String Name;

  // all relations where
  // this is FromPerson
  Relation[] FromRelations; 

  // all relations where
  // this is ToPerson
  Relation[] ToRelations;

  DateTime birthDate;
  DateTime? deathDate;
}

Relation
{
  Person FromPerson;
  Person ToPerson;
  RelationType Type;
}

enum RelationType
{
  Father,
  Son,
  Daughter,
  Mother
}

这种看起来像双向图。但在这种情况下，首先构建所有 Person 的列表，然后可以构建列表关系并在每个节点之间设置 FromRelations 和 ToRelations。那么您所要做的就是，对于每个人，您只需导航类型为 (Son,Daughter) 的 ToRelations。既然有了日期，你就可以计算一切。

我没有时间检查代码的正确性，但这会让您了解如何做到这一点。

void LabelPerson(Person p){
   int n = GetLevelOfChildren(p, p.birthDate, p.deathDate);
   // label based on n...
}

int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){
   List<int> depths = new List<int>();
   foreach(Relation r in p.ToRelations.Where(
             x=>x.Type == Son || x.Type == Daughter))
   {
      Person child = r.ToPerson;
      if(ed!=null && child.birthDate <= ed.Value){
         depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
      }else
      {
         depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
      }
   }
   if(depths.Count==0)
      return 0;
   return depths.Max();
}

We recently implemented relationship module in one of our project in which we had everything in database and yes I think algorithm was best 2nO(m) (m is max branch factor). I multiplied operations twice to N because in first round we create relationship graph and in second round we visit every Person. We have stored bidirectional relationship between every two nodes. While navigating, we only use one direction to travel. But we have two set of operations, one traverse only children, other traverse only parent.

Person{
  String Name;

  // all relations where
  // this is FromPerson
  Relation[] FromRelations; 

  // all relations where
  // this is ToPerson
  Relation[] ToRelations;

  DateTime birthDate;
  DateTime? deathDate;
}

Relation
{
  Person FromPerson;
  Person ToPerson;
  RelationType Type;
}

enum RelationType
{
  Father,
  Son,
  Daughter,
  Mother
}

This kind of looks like bidirectional graph. But in this case, first you build list of all Person, and then you can build list relations and setup FromRelations and ToRelations between each node. Then all you have to do is, for every Person, you have to only navigate ToRelations of type (Son,Daughter) only. And since you have date, you can calculate everything.

I dont have time to check correctness of the code, but this will give you idea of how to do it.

void LabelPerson(Person p){
   int n = GetLevelOfChildren(p, p.birthDate, p.deathDate);
   // label based on n...
}

int GetLevelOfChildren(Person p, DateTime bd, DateTime? ed){
   List<int> depths = new List<int>();
   foreach(Relation r in p.ToRelations.Where(
             x=>x.Type == Son || x.Type == Daughter))
   {
      Person child = r.ToPerson;
      if(ed!=null && child.birthDate <= ed.Value){
         depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
      }else
      {
         depths.Add( 1 + GetLevelOfChildren( child, bd, ed));
      }
   }
   if(depths.Count==0)
      return 0;
   return depths.Max();
}

回复收藏 0 原文

暖伴 2024-11-16 10:20:26

这是我的刺：

class Person
{
    Person [] Parents;
    string Name;
    DateTime DOB;
    DateTime DOD;
    int Generations = 0;

    void Increase(Datetime dob, int generations)
    {
        // current person is alive when caller was born
        if (dob < DOD)
            Generations = Math.Max(Generations, generations)
        foreach (Person p in Parents)
            p.Increase(dob, generations + 1);
    }

    void Calculate()
    {
        foreach (Person p in Parents)
            p.Increase(DOB, 1);
    }
}

// run for everyone
Person [] people = InitializeList(); // create objects from information
foreach (Person p in people)
    p.Calculate();

Here's my stab:

class Person
{
    Person [] Parents;
    string Name;
    DateTime DOB;
    DateTime DOD;
    int Generations = 0;

    void Increase(Datetime dob, int generations)
    {
        // current person is alive when caller was born
        if (dob < DOD)
            Generations = Math.Max(Generations, generations)
        foreach (Person p in Parents)
            p.Increase(dob, generations + 1);
    }

    void Calculate()
    {
        foreach (Person p in Parents)
            p.Increase(DOB, 1);
    }
}

// run for everyone
Person [] people = InitializeList(); // create objects from information
foreach (Person p in people)
    p.Calculate();

回复收藏 0 原文

淡莣 2024-11-16 10:20:26

有一个相对简单的 O(n log n) 算法，可以在合适的的帮助下按时间顺序扫描事件顶树。
你真的不应该布置你自己无法解决的作业。

回复收藏 0 原文

瀟灑尐姊 2024-11-16 10:20:25

更新：这不是我想出的最好的解决方案，但我已经离开了它，因为有很多与之相关的评论。

你有一组事件（出生/死亡）、父母状态（没有后代、父母、祖父母等）和生命状态（活着、死了）。

我会将数据存储在具有以下字段的结构中：

mother
father
generations
is_alive
may_have_living_ancestor

按日期对事件进行排序，然后对于每个事件采用以下两个逻辑过程之一：

Birth:
    Create new person with a mother, father, 0 generations, who is alive and may
        have a living ancestor.
    For each parent:
        If generations increased, then recursively increase generations for
            all living ancestors whose generations increased.  While doing that,
            set the may_have_living_ancestor flag to false for anyone for whom it is
            discovered that they have no living ancestors.  (You only iterate into
            a person's ancestors if you increased their generations, and if they
            still could have living ancestors.)

Death:
    Emit the person's name and generations.
    Set their is_alive flag to false.

最坏的情况是 O(n*n)如果每个人都有很多活着的祖先。然而，一般来说，排序预处理步骤是O(n log(n))，然后你就是O(n * 活着祖先的平均数量)这意味着在大多数人群中，总时间往往是O(n log(n))。（我没有正确计算排序前置步骤，感谢@Alexey Kukanov 的更正。）

Update: This is not the best solution I have come up with, but I've left it because there are so many comments relating to it.

You have a set of events (birth/death), parental state (no descendants, parent, grandparent, etc) and life state (alive, dead).

I would store my data in structures with the following fields:

mother
father
generations
is_alive
may_have_living_ancestor

Sort your events by date, and then for each event take one of the following two courses of logic:

Birth:
    Create new person with a mother, father, 0 generations, who is alive and may
        have a living ancestor.
    For each parent:
        If generations increased, then recursively increase generations for
            all living ancestors whose generations increased.  While doing that,
            set the may_have_living_ancestor flag to false for anyone for whom it is
            discovered that they have no living ancestors.  (You only iterate into
            a person's ancestors if you increased their generations, and if they
            still could have living ancestors.)

Death:
    Emit the person's name and generations.
    Set their is_alive flag to false.

The worst case is O(n*n) if everyone has a lot of living ancestors. However in general you've got the sorting preprocessing step which is O(n log(n)) and then you're O(n * avg no of living ancestors) which means that the total time tends to be O(n log(n)) in most populations. (I hadn't counted the sorting prestep properly, thanks to @Alexey Kukanov for the correction.)

回复收藏 0 原文

何时共饮酒 2024-11-16 10:20:25

今天早上我想到了这一点，然后发现@Alexey Kukanov也有类似的想法。但我的更加充实并且有更多优化，所以无论如何我都会发布它。

该算法的复杂度为O(n * (1 + Generation))，适用于任何数据集。对于实际数据，这是O(n)。

遍历所有记录并生成代表人员的对象，其中包括出生日期、与父母的链接、与孩子的链接以及其他几个未初始化的字段。（自我和祖先之间最后一次死亡的时间，以及他们有 0,1,2,... 幸存代的一系列日期。）
遍历所有人并递归查找并存储最后一次死亡的时间。如果您再次致电此人，请返回已存储的记录。对于每个人，您都可以遇到该人（需要计算它），并且可以在第一次计算时对每个父母再生成 2 个调用。这总共需要 O(n) 工作来初始化此数据。
遍历所有人并递归生成他们第一次添加一代的记录。这些记录最多只需要到该人或其最后一个祖先去世的时间即可。当您有 0 代时，计算时间复杂度为 O(1)。然后，对于对子项的每次递归调用，您需要执行 O( Generations) 工作将该子项的数据合并到您的数据中。当您在数据结构中遇到每个人时，每个人都会被呼叫，并且可以从每个父级调用一次，每次 O(n) 次调用，总费用 O(n * ( Generations + 1))。
查遍所有的人，算出他们去世时还活着多少代人。如果使用线性扫描实现，这又是O(n * (generations + 1))。

所有这些操作的总和是O(n * (generations + 1))。

对于实际数据集，这将是 O(n) 且常数相当小。

回复收藏 0 原文

终止放荡 2024-11-16 10:20:25

我的建议：

除了问题陈述中描述的值之外，每个个人记录还将有两个字段：子计数器和动态增长向量（在 C++/STL 意义上），它将保留一个人的后代的每一代中最早的生日。
使用哈希表来存储数据，以人名作为键。构建它的时间是线性的（假设有一个好的哈希函数，映射已为插入和查找摊销了常数时间）。
对于每个人，检测并保存孩子的数量。它也是在线性时间内完成的：对于每个个人记录，找到其父记录并增加其计数器。此步骤可以与上一步结合起来：如果找不到父记录，则会创建并添加该记录，而在输入中找到时将添加详细信息（日期等）。
遍历地图，将对所有没有子项的个人记录的引用放入队列中。仍然是O(N)。
对于从队列中取出的每个元素：
- 将此人的生日添加到父母双方的 descendant_birthday[0] 中（如有必要，请增加该向量）。如果已设置此字段，则仅当新日期较早时才更改它。
- 对于当前记录向量中可用的所有 descendant_birthday[i] 日期，请遵循与上述相同的规则来更新parents 中的 descendant_birthday[i+1]记录。
- 减少父母的孩子计数器；如果达到0，则将相应父级的记录添加到队列中。
- 此步骤的成本为 O(C*N)，其中 C 是给定输入的“族深度”的最大值（即最长 descendant_birthday< 的大小） /代码>矢量）。对于实际数据，它可以被一些合理的常数限制，而不会损失正确性（正如其他人已经指出的那样），因此不依赖于 N。
再次遍历地图，并用最大的 i 来“标记每个人” 其中 descendant_birthday[i] 仍早于死亡日期；也是O(C*N)。

因此，对于实际数据，可以在线性时间内找到问题的解决方案。尽管对于 @btilly 评论中建议的人为数据，C 可能很大，甚至在退化情况下约为 N 的数量级。它可以通过设置向量大小上限或通过使用 @btilly 解决方案的步骤 2 扩展算法来解决。

如果输入数据中的父子关系是通过名称提供的（如问题陈述中所写），则哈希表是解决方案的关键部分。如果没有哈希，则需要 O(N log N) 来构建关系图。大多数其他建议的解决方案似乎都假设关系图已经存在。

My suggestion:

additionally to the values described in the problem statement, each personal record will have two fields: child counter and a dynamically growing vector (in C++/STL sense) which will keep the earliest birthday in each generation of a person's descendants.
use a hash table to store the data, with the person name being the key. The time to build it is linear (assuming a good hash function, the map has amortized constant time for inserts and finds).
for each person, detect and save the number of children. It's also done in linear time: for each personal record, find the record for its parents and increment their counters. This step can be combined with the previous one: if a record for a parent is not found, it is created and added, while details (dates etc) will be added when found in the input.
traverse the map, and put references to all personal records with no children into a queue. Still O(N).
for each element taken out of the queue:
- add the birthday of this person into descendant_birthday[0] for both parents (grow that vector if necessary). If this field is already set, change it only if the new date is earlier.
- For all descendant_birthday[i] dates available in the vector of the current record, follow the same rule as above to update descendant_birthday[i+1] in parents' records.
- decrement parents' child counters; if it reaches 0, add the corresponding parent's record into the queue.
- the cost of this step is O(C*N), with C being the biggest value of "family depth" for the given input (i.e. the size of the longest descendant_birthday vector). For realistic data it can be capped by some reasonable constant without correctness loss (as others already pointed out), and so does not depend on N.
traverse the map one more time, and "label each person" with the biggest i for which descendant_birthday[i] is still earlier than the death date; also O(C*N).

Thus for realistic data the solution for the problem can be found in linear time. Though for contrived data like suggested in @btilly's comment, C can be big, and even of the order of N in degenerate cases. It can be resolved either by putting a cap on the vector size or by extending the algorithm with step 2 of @btilly's solution.

A hash table is key part of the solution in case if parent-child relations in the input data are provided through names (as written in the problem statement). Without hashes, it would require O(N log N) to build a relation graph. Most other suggested solutions seem to assume that the relationship graph already exists.

回复收藏 0 原文

尘世孤行 2024-11-16 10:20:25

创建人员列表，按出生日期排序。创建另一个人员列表，按死亡日期排序。您可以逻辑地穿越时间，从这些列表中删除人员，以获得事件发生时的列表。

为每个 Person 定义一个 is_alive 字段。一开始这对每个人来说都是错误的。随着人们的出生和死亡，相应地更新此记录。

为每个人定义另一个字段，称为 has_a_living_ancestor，首先为每个人初始化为 FALSE。出生时，x.has_a_living_ancestor 将被设置为 x.mother.is_alive || x.mother.has_a_living_ancestor || x.father.is_alive || x.father.has_a_living_ancestor。因此，对于大多数人（但不是所有人）来说，这将在出生时设置为 TRUE。

挑战在于确定 has_a_living_ancestor 可以设置为 FALSE 的情况。每次一个人出生时，我们都会对祖先进行 DFS，但仅限于那些 ancestor.has_a_living_ancestor || 的祖先。 ancestor.is_alive 为 true。

在 DFS 期间，如果我们发现一个祖先没有活着的祖先，并且现在已经死亡，那么我们可以将 has_a_living_ancestor 设置为 FALSE。我认为，这确实意味着有时 has_a_living_ancestor 会过时，但希望很快就能被捕获。