如何压缩有向图中的重复分支?
我经常使用 Java 程序堆转储产生的有向图。它们的特点之一是它们包含大量重复的模式。我想找到一种压缩这些模式的方法,同时仍然保留图形的基本结构。例如,考虑以下“分子”:
| |
A1 A5
/ \ / \
B2 C4 B6 C8
\ / \ /
D3 D7
字母代表对象的类型,数字代表其唯一的 id(或 dfnum)。显然,第二个分子是第一个分子的重复,只是具有不同的 ID。从堆分析的角度来看,实际的 id 并不重要,因此您可以将 A5 处的分子替换为实际上表示“A1 的另一个副本”的内容。在解压缩时(例如,对于堆分析器的输入),您可以分配任意唯一的 id。
我可以通过在图的 dfs 期间维护对象类型的散列来发现此类模式。因此,A1 的哈希值将包含(例如)“A^B^C^D”,这将与 A5 的哈希值匹配。
我遇到的问题是分子指向某个外部分子。我所说的外部是指之前在 dfs 中访问过的东西。例如(对丑陋的ascii图形感到抱歉):
| |
A1 A5
/ \ / \
B2 C4 B6 C7
\ \ / /
\ \ / /
\ | | /
\ | | /
\| |/
D3
对于这种情况,当我从A5下降时,我发现D3已经被访问过。所以我希望 A5 的哈希码代表 D3 的唯一值,而不仅仅是类型。即类似“A^B^C^D3”的内容。因此,在压缩/解压缩时,我们将 A5 区分为 A1 的副本,而不是 B 和 C 指向其他 D 的其他 A。
我的问题是是否有任何技巧可以进行这样的计算,即如何告诉 D是在我们的分子“外部”吗?其根是A?这也必须以有效的方式完成,最好使用一个或两个 DFS。 (堆转储包含数千万个对象)。您事先并不知道 A 是候选者,因此它可能需要是一种在 dfs 期间执行此操作的算法。也许与支配树有关?
希望这是有道理的!
编辑:更新图表以澄清 A1 和 A5 本身有父节点并且只是在整个图的 DFS 遍历过程中发现的任意节点这一事实。
澄清:就我的目的而言,100% 保证匹配并不重要。通过使用上述哈希码,我很高兴冒着极小的机会发生哈希冲突,并且算法会错误地对分子进行错误分类。由于此类碰撞很少见,因此不太可能对整体情况产生太大影响。
I work a lot with directed graphs resulting from heap dumps of Java programs. One thing that characterizes them is that they contain lots of repeating patterns. I would like to find a way of compressing such patterns whilst still retaining the essential structure of the graph. For instance consider the following "molecules":
| |
A1 A5
/ \ / \
B2 C4 B6 C8
\ / \ /
D3 D7
the letter represents the type of the object and the number represents its unique id (or dfnum). Clearly the second molecule is a repeat of the first just with different ids. From a heap analysis point of view, the actual ids are unimportant so you could replace the molecule at A5 with something that said effectively "another copy of A1". On decompression (for input to heap analysers for instance) you could just assign arbitrary unique ids.
I could spot such patterns by maintaining a hash of the type of the objects during the dfs of the graph. So the hash for A1 would contain (for instance) "A^B^C^D" and this would match that for A5.
The problem I have is with molecules that point to some external molecule. By external I mean something that has been visited earlier in the dfs. For instance (sorry about the ugly ascii graphics):
| |
A1 A5
/ \ / \
B2 C4 B6 C7
\ \ / /
\ \ / /
\ | | /
\ | | /
\| |/
D3
for this situation when I come to descend from A5 I find that D3 has already been visited. So I would like the hash code for A5 to represent the unique value for D3 rather than just the type. I.e. something like "A^B^C^D3". So on compression/decompression we are differentiating A5 as being a copy of A1 rather than some other A whose B and C point to some other D.
My question is whether there are any tricks to do such a calculation, i.e. how to tell that D is "outside" our molecule whose root is A? This also has to be done in an efficient manner, preferably with one or two dfs's. (Heapdumps contain tens of millions of objects). You don't know in advance that A is a candidate so it probably needs to be an algorithm that does it during the dfs. Maybe something dominator tree related?
Hope that makes sense!
Edit: updated diagrams to clarify the fact that A1 and A5 themselves have parents and are just arbitrary nodes discovered during a DFS walk of the entire graph.
Clarification: for my purposes it's not important that the match is 100% guaranteed. By using hash codes as above I am happy to take the very small chance that there will be a hash collision and the algorithm will mistakenly classify a molecule incorrectly. Because such collisons will be rare they are unlikely to much affect the overall picture.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没有太深入地考虑你的问题,我已经使用 解决了为我的研究压缩网络数据的问题特里。您应该能够将数据序列化到特里树上以进行压缩。
Without thinking about your problem in too much depth, I have solved my problem of compressing web data for my research using a trie. You should be able to serialize your data onto a trie for the purposes of compression.
据我所知,这可能与图同构问题有关。维基百科页面有一些关于当前解决此问题的方法的指示。然而,从简单的浏览来看,其中大多数似乎都是为了考虑两个完整的图而设计的,而不是在更大的图中寻找同构子图。
关于搜索算法,我的直觉是深度优先搜索不是解决这个问题的正确方法。您可能会想一下广度优先遍历会做什么。至少在您描述的具体示例中,这将允许您首先查看 A1 和 A5,然后再确定其中任何一个的特定“分子”形状。
From what I can tell, this is likely to be related to the graph isomorphism problem. The wikipedia page has a few pointers to current approaches to this problem. However, from a brief skimming, most of these seem to be designed to consider two entire graphs, and not look for isomorphic subgraphs within a larger graph.
With respect to search algorithms, my gut feeling is that depth-first search is not the right approach for this problem. You might think for a bit about what a breadth-first traversal might do. At least in the specific example you describe, that would allow you to look at A1 and A5 first, before committing to a specific "molecule" shape at either one.
这可能是一个过于抽象的答案,无法真正提供帮助,但您可以为每个重复模式创建一个子图,然后将每个模式出现折叠到单个节点(带有指向相应子图结构的指针)。该节点将管理连接到该模式的所有边。这些边还必须记住它们连接到的模式的节点,因此您将能够提供隐藏表示细节的图形遍历,但遍历图形就好像它是“按其含义”一样。如果您无法抽象内部表示并且您的算法需要理解嵌套图,这将会很复杂。
附带说明一下,虽然图同构通常是一个棘手的问题,但在您的情况下,您的图上有很多元数据(例如对象类型、字段名称等),这相当于有一个标记图表,具有非常罕见的选择性标签。这些标签大大减少了查找同构模式所需的工作量(当然,最大尺寸要小,否则您的模式缓存将填满所有内存)。
由于将会有很多对象紧密遵循类定义(如果没有继承,对象将是结构体,并且它们的运行时类型将与定义完全匹配),我预测您想要做的事情有很多显着压缩对象图的潜力。
This is probably a too abstract answer to actually help, but you can create a subgraph per repeated pattern, then collapse each pattern occurrence to a single node (with a pointer to the respective subgraph structure). The node would manage all edges connected to the pattern. Such edges must also remember the node of the pattern they connect to, so you would be able to offer graph traversals which hide the details of the representation, but traverse the graph as if it was "as it meant to be". This would be complicated if you fail to abstract the internal representation and your algorithms need to understand nested graphs.
As a side note, while graph isomorphism is generally a tough problem, in your case you have so much metadata on your graph (e.g. object types, field names, etc), which is equivalent to having a labeled graph, with very rare, selective labels. Such labels greatly prune the required effort to find isomorphic patterns (up to a small size of course, otherwise your pattern cache would fill all memory).
Since there are going to be lots of objects that follow closely the class definitions (if there was no inheritance, objects would be structs and their runtime types would exactly match the definitions), I predict that what you're trying to do has lots of potential to significantly compress the object graph.