大型 DAG 上的拓扑排序示例
我正在寻找现实世界的应用程序,其中对大图尺寸执行拓扑排序。
我想象您可以找到此类实例的一些领域是生物信息学、依赖性解析、数据库、硬件设计、数据仓库......但我希望你们中的一些人可能遇到或听说过任何需要的特定算法/项目/应用程序/数据集顶排序。
即使数据/项目可能无法公开访问,任何提示(以及对潜在图形大小的数量级的估计)也可能会有所帮助。
I am looking for real world applications where topological sorting is performed on large graph sizes.
Some fields where I image you could find such instances would be bioinformatics, dependency resolution, databases, hardware design, data warehousing... but I hope some of you may have encountered or heard of any specific algorithms/projects/applications/datasets that require topsort.
Even if the data/project may not be publicly accessible any hints (and estimates on the order of magnitude of potential graph sizes) might be helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
以下是我迄今为止看到的一些拓扑排序示例:
在分布式系统中调度任务图时,通常是
需要对任务进行拓扑排序,然后将它们分配给
资源。我知道任务图包含超过 100,000 个
任务按拓扑顺序排序。请参阅此上下文中的此。
曾几何时,我正在开发文档管理系统。每个
该系统上的文档具有某种优先级约束
其他文档集,例如其内容类型或字段引用。
然后,系统应该能够生成订单文件
并保留拓扑顺序。据我记得,有
两年前大约有 5,000,000 个可用文档!!!
在社交网络领域,有一个著名的查询:
网络中最大的友谊距离。这个问题需要
通过 BFS 方法遍历图,等于
拓扑排序。考虑 Facebook 的成员并找到您的
如果您需要更多真实例子,请随时询问我。我参与过许多涉及大型图表的项目。
PS 对于大型 DAG 数据集,您可以查看 Stanford Large Network Dataset Collection 和 Graphics@Illinois 页面。
Here are some examples I've seen so far for Topological Sorting:
While scheduling task graphs in a distributed system, it is usually
needed to sort the tasks topologically and then assign them to
resources. I am aware of task graphs containing more than 100,000
tasks to be sorted in a topological order. See this in this context.
Once upon a time I was working on a Document Management System. Each
document on this system has some kind of precedence constraint to a
set of other documents, e.g. its content type or field referencing.
Then, the system should be able to generate an order of the documents
with the preserved topological order. As I can remember, there were
around 5,000,000 documents available two years ago !!!
In the field of social networking, there is famous query to know the
largest friendship distance in the network. This problem needs to
traverse the graph by a BFS approach, equal to the cost of a
topological sorting. Consider the members of Facebook and find your
answer.
If you need more real examples, do not hesitate to ask me. I have worked in lots of projects working on on large graphs.
P.S. for large DAG datasets, you may take a look at Stanford Large Network Dataset Collection and Graphics@ Illinois page.
我不确定这是否适合您正在寻找的内容,但您知道 Bio4j 项目吗?
并非所有存储在基于图的数据库中的内容都足以进行拓扑排序(图的重要部分存在有向循环),但是存在像基因本体和分类学这样的子图,这种排序可能是有意义的。
I'm not sure if this fits what you're looking for but did you know Bio4j project?
Not all the contents stored in the graph based DB would be adequate for topological sorting (there exists directed cycles in an important part of the graph), however there are sub-graphs like Gene Ontology and Taxonomy where this ordering may have sense.
TopoR 是一款商业拓扑 PCB 路由器,它首先将 PCB 布线为拓扑问题,然后将拓扑转换为物理拓扑空间。它们支持多达 32 个电气层,因此应该能够支持数千个连接(例如 10^4)。
我怀疑集成电路可能会使用类似的方法。
TopoR is a commercial topological PCB router that works first by routing the PCB as topological problem and then translating the topology into physical space. They support up to 32 electrical layers, so it should be capable of many thousands of connections (say 10^4).
I suspect integrated circuits may use similar methods.
我工作的公司管理着一个(专有)软件漏洞和补丁数据库。补丁通常由软件供应商(如 Microsoft、Adobe 等)定期发布,并且“新的和改进的”补丁“取代”旧补丁,从某种意义上说,如果您将较新的补丁应用到主机,则旧的补丁将被应用到主机上。不再需要补丁。
这就产生了一个 DAG,其中每个软件补丁都是一个节点,其中的弧指向每个“取代”补丁的节点。目前图中有接近 10K 个节点,并且每周都会添加新的补丁。
拓扑排序在这种情况下非常有用,可以验证图表是否包含循环 - 如果确实出现循环,则意味着添加新数据库记录时出现错误,或者数据库实例之间的数据复制失败而导致损坏。
The company where I work manages a (proprietary) database of software vulnerabilities and patches. Patches are typically issued by a software vendor (like Microsoft, Adobe, etc.) at regular intervals, and "new and improved" patches "supercede" older ones, in the sense that if you apply the newer patch to a host then the old patch is no longer needed.
This gives rise to a DAG where each software patch is a node with arcs pointing to a node for each "superceding" patch. There are currently close to 10K nodes in the graph, and new patches are added every week.
Topological sorting is useful in this context to verify that the graph contains no cycles - if they do arise then it means that there was either an error in the addition of a new DB record, or corruption was introduced by botched data replication between DB instances.