压缩图表示?
我现在正在做一个副项目,涉及对维基百科页面之间的所有链接进行编码。我已将这些信息抓取到磁盘上,但是编码该图的结构所需的内存使用量非常可笑 - 有数百万个节点和数千万个链接。虽然这种结构确实适合内存,但我不确定如果有十亿个链接或十亿个页面,我会做什么。
我的问题是 - 有没有一种方法可以无损压缩太大而无法放入内存的图形,使其适合内存?如果没有,是否有一种好的有损算法,对于“结构”的某些定义,不会从原始图中丢失太多结构?
I'm working on a side project now that involves encoding all of the links between Wikipedia pages. I've scraped this information to disk, but the memory usage required to encode the structure of this graph is pretty ridiculous - there's millions of nodes and tens of millions of links. While this structure does fit in memory, I'm not sure what I'd do if there were, say, a billion links or a billion pages.
My question is - is there a way of losslessly compressing a graph too large to fit into memory so that it does fit in memory? If not, is there a good lossy algorithm that for some definition of "structure" doesn't lose too much structure from the original graph?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
像链接图和社交图这样的图已经得到了很好的研究,它们通常具有可以实现高效压缩表示的统计属性。
例如,这些属性之一是,对于传出边,邻接列表的差分编码具有低幂分布,即存在大量非常小的值和非常少的大值,因此大多数 通用代码工作得很好。特别是 zeta 代码 类在此设置中被证明是最佳的,并且在论文中,作者将小型网络爬行的链接图压缩为每个链接约 3 位。
他们的代码(针对 Java、Python 和 C++)可在其网页中作为图形压缩框架使用,因此您应该能够在不需要太多编码的情况下进行试验。
这个算法有点旧(2005),该领域已经有了发展,但我现在没有论文的指针,改进无论如何都不显着,我不认为有任何可用和经过测试的代码实现它们。
Graphs like links graphs and social graphs are very well studied and they usually have statistical properties that enable efficient compressed representations.
One of these properties, for example, is that for outgoing edges the differential encoding of the adjacency list has a power low distribution, i.e. there are a lot of very small values and very few big values, so most universal codes work quite well. In particular the class of zeta codes is provably optimal in this setting, and in the paper the authors compressed the link graph of a small web crawl with about 3 bits per link.
Their code (for Java, Python and C++) is available in their webpage as a graph compression framework, so you should be able to experiment with it without much coding.
This algorithm is kind of old (2005) and there have been developments in the field but I don't have the pointers to the papers right now, the improvements are anyway not significant and I don't think there is any available and tested code that implements them.
不久前我参与了一篇论文关于压缩网络图以便它们适合内存。我们将每个链接的长度减少到大约 6 位。
I was part of a paper a while ago about compressing web graphs so they would fit in memory. We got it down to about 6 bits per link.
一般来说,如果您有 N 个节点,并且每个节点平均有 X 个传出链接,X 远小于 N,则您将需要 XN ln N 位信息来表示这一点,除非您可以在链接结构中找到模式(然后你可以利用它来降低熵)。 XN ln N 与 32 位邻接表的复杂度相差一个数量级。
您可以采取一些技巧来进一步减小大小:
来自 Giuseppe 的链接值得检查,但只有实验才能告诉您这些算法对维基百科的适用程度。
Quite generally speaking, if you have N nodes and an average of X outgoing links per node, X much smaller than N, you're going to need XN ln N bits of information to represent this, unless you can find patterns in the link structure (which you can then exploit to bring down the entropy). XN ln N is within an order of magnitude from the complexity of your 32-bit adjacency list.
There are some tricks you could do to bring down the size some more:
Links from Giuseppe are worth checking, but only the experiment will tell you how well those algorithms are applicable to Wikipedia.
将节点、链接和关联写入现有的可扩展数据库系统(MySQL、SQL Server、Oracle 等)怎么样?如果需要,您可以创建索引和存储过程以实现更快的数据库级处理。
如果由于某种原因无法走这条路线,则需要将数据分页进出(就像数据库系统一样!)。在许多情况下,压缩数据是一种短期的创可贴。如果您由于某种原因无法提高 RAM 上限,那么您只是为自己争取了有限的时间,因此我建议不要压缩它。
What about just writing your nodes, links, and associations to an existing scalable database system (MySQL, SQL Server, Oracle, etc)? You can create indexes and stored procedures for faster DB-level processing, if needed.
If you can't go this route for some reason, you'll need to page data in and out (just like DB systems do!). Compressing the data is a short term band aid in many cases. If you can't raise the RAM roof for some reason, you're only buying yourself limited time, so I'd recommend against compressing it.
如果您不需要可变性,请查看 BGL 如何在 压缩稀疏行格式。根据文档,它“将内存使用最小化到 O(n+m),其中 n 和 m 分别是顶点和边的数量”。 Boost Graph Library 甚至还有示例这反映了您的用例。
在深入讨论之前,您应该真正弄清楚打算如何查询图表。您是否需要指向页面的链接以及页面外的链接?您是否需要能够有效地查找给定页面上的链接数量?有关基本图形操作的经过深思熟虑的列表,请查看 Boost Graph Library (BGL) 概念。然后,您可以将其映射到不同算法的要求。例如,Dijkstra 的最短路径需要对“顶点列表图”和“关联图”进行建模的图。
If you do not need mutability, take a look at how BGL represents a graph in a compressed sparse row format. According to the docs it "minimizes memory use to O(n+m) where n and m are the number of vertices and edges, respectively". Boost Graph Library even has an example that mirrors your use case.
Before you go to far with this, you should really figure out how you intend to interrogate your graph. Do you need links pointing to the page as well as links out of a page? Do you need to be able to efficiently find the number of links on a given page? For a pretty well thought out list of basic graph operations, take a look at Boost Graph Library's (BGL) concepts. You can then map this to requirements for different algorithms. Dijkstra's shortest path, for example, requires a graph that models "Vertex List Graph" and "Incidence Graph".
在你的情况下,你试图将单个图压缩到内存中,而不是一个通用的大族图。当您只有单个图要压缩时,您可以找到它的任意算法表示,这成为 Kolmogorov 的问题复杂性。一般来说,您无法有效地压缩随机图,因为它们是随机的,因此无法预测,而当它们无法预测时,它们就无法压缩。这来自于基础信息论;这与无法压缩带有随机噪声的图像是一样的。
假设您有 230(十亿)个页面,每个人都有 24 个出站链接,并且这些链接是真正随机分布的。每个页面上的链接几乎代表 16 * 30 位信息(不完全是因为 16 个链接都是不同的,这增加了少量的冗余)。因此,您有 230 * 16 * 30 = 232 * 120 = 15 GB 的信息,并且信息论表明您无法找到更小的 GENERAL 表示形式。您需要使用维基百科图的特定结构来低于信息论下限。
in your case you are trying to compress a SINGLE graph into a memory instead a general, large family of graphs. When you have only single graph to compress, you can find any arbitrary algorithmic presentation for it and this becomes an issue of Kolmogorov complexity. In general, you can't compress random graphs efficiently because they are random and thus can't be predicted and when they can't be predicted they can't be compressed. This comes from basic information theory; it's the same thing that you can't compress images with random noise.
Suppose you have 230 (billion) pages and everyone has exactly 24 outbound links and that the links are truly randomly distributed. The links on every page represent almost 16 * 30 bits of information (not totally because the 16 links are all distinct and this adds a minuscule amount of redundancy). So you have 230 * 16 * 30 = 232 * 120 = 15 GB worth of information there, and information theory says you can't find a smaller GENERAL representation. You need to use the particular structure of the Wikipedia graph to get below that information-theoretic lower bound.