如何对无标度图进行采样

发布于 2024-09-29 22:16:47 字数 497 浏览 7 评论 0原文

给定一个大型无标度图(社交网络图),对其进行采样以使样本保留原始属性的可接受的抽象的最佳方法是什么?

我有一个大图(Munmun 的 Twitter 数据集,如果你知道的话)。但我需要该图的一个连接样本,具有相当大的直径(tl;dr...为什么应要求...直径为 10 会很好)。

问题是任何广度优先搜索总是可能遇到一些大规模连接的节点。所以我开始这样的搜索,获取我遇到的所有节点的朋友。我不可避免地会遇到一些大规模连接的节点,并且必须找到它们所有的朋友。这是一个问题,因为我最终得到了图中彼此靠近的大量节点。为了使编程分析可行,我必须限制节点(和边)的数量。这个练习的重点是找到节点之间的最短路径,所以我通常对节点的所有邻居感兴趣。这就是问题所在。

解决这个问题的一种技巧是限制最大值。连接到我感兴趣的用户的节点数量。例如,如果我在广度优先搜索中遇到@barackobama,我确保只接受他的一小部分朋友,而忽略其余的。但是这个被黑的图表值得一去吗,还是我在寻找最短路径方面丢失了太多信息?

希望这是有道理的...

Given a large scale-free graph (a social network graph), what's the best way to sample it such that the sample retains an acceptable abstraction of the properties of the original?

I have a large graph (Munmun's twitter dataset, if you know it). But I need a connected sample of that graph with a reasonably large diameter (tl;dr... reasons why on request... a diameter of 10 would be good).

The problem is that any kinda breadth-first search always is likely to come across some massively connected nodes. So I start such a search, getting the friends of all nodes which I come across. I inevitably come across some massively-connected nodes, and have to get all their friends. This is a problem because I end up with a large number of nodes which are close to each other in the graph. To make programmatic analysis feasible, I have to limit the number of nodes (and edges). The whole point of this exercise is to find shortest paths between nodes, so I'm generally interested in ALL of a node's neighbours. And that's the problem.

One hack around this is to limit the max. number of nodes connected to a user which I'm interested in. For instance, if I come across @barackobama in my breadth-first search, I ensure that I only accept some small proportion of his friends and ignore the rest. But would this hacked graph be worth a damn, or am I losing too much information in terms of finding shortest paths??

Hope that makes sense...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

任谁 2024-10-06 22:16:47

存在多种采样方法,如何选择一种方法取决于(除其他外)您想要保留的属性。我在论文复杂网络中的采样和推理中找到了文献综述(第3节)[ Maiya '11] 就此而言,信息非常丰富。

但您似乎已经找到了一种对网络进行采样的方法,现在您想了解样本是否能够代表整个图的最短路径。您可以尝试看看这篇论文:复杂网络测量:估计相关性观察到的属性 [Latapy &马尼安'08]。他们描述了一种评估样本代表性的方法,涉及各种经典拓扑特性。总结他们的方法,他们最初可以访问整个研究网络,并随着样本量的增加,对这些数据模拟一些采样过程。他们监控属性如何根据样本大小演变,并在感兴趣的属性足够稳定时决定适当的大小。他们的工具在线免费提供

编辑:我可以在网上找到的唯一现成的工具是 Albatross 。相关文章Albatross Sampling: Robust and effective Hybrid Vertex Sampling for Social Graphs [Jin et al. '11] 还包含对现有采样方法的精彩回顾,其中一些方法在他们提供的源代码中实现。

编辑2:我需要在Linux系统上使用Albatross,所以我做了一个Java移植。它很原始,但看起来效果很好。它可以在 GitHub 上找到:https://github.com/vlabatut/Albatross

Several sampling methods exist, how to choose one depends (amongst other things) of the properties you want to preserve. I found the literature review (section 3) in the thesis Sampling and Inference in Complex Networks [Maiya '11] very informative, for that matter.

But you seem to have found a way of sampling your network, and now you want to find out if the sample is representative of the whole graph in terms of shortest paths. You can try to have a look at this paper: Complex Network Measurements: Estimating the Relevance of Observed Properties [Latapy & Magnien '08]. They describe a method to assess the representativeness of a sample, regarding various classic topological properties. To summarize their approach, they initially have access to the whole studied network, and simulate some sampling process on these data, with increasing sample size. They monitor how properties evolve depending on sample size, and decide of an appropriate size when the properties of interest are stable enough. Their tool is freely available online.

Edit: the only ready-to-use tool I could find online is Albatross. The associated article Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs [Jin et al. '11] also contains a nice review of existing sampling methods, some of which are implemented in the source code they provide.

Edit 2: I needed to use Albatross on a Linux system, so I did a Java port. It's very raw, but it seems to work fine. It's available on GitHub: https://github.com/vlabatut/Albatross

青衫负雪 2024-10-06 22:16:47

我不确定我是否正确理解你的问题。我认为您面临的主要问题是,如何计算一个巨大的有向图中两个节点的最短路径。创建图表的子样本似乎是您创建有效解决方案的尝试。 (但我可能完全误解了你。)

也许这个问题对你有一些指导: 有效地找到大图中的最短路径

不过,该问题中的图似乎要小得多。

I am not sure, if I understand your question correctly. I think the main question you have is, about how you can compute the shortest path of two nodes in a giant, directed graph. Creating a subsample of the graph seems to be your attempt to create an efficient solution. (But I probably misunderstood you completely.)

Perhaps this SO-Question has some pointers for you: Efficiently finding the shortest path in large graphs

The graphs in that question seem to be significantly smaller, though.

℉服软 2024-10-06 22:16:47

您可能需要检查以下内容:Gscaler:https://github.com/jayCool/Gscaler
这是一个最新的工具,可以生成合成比例图。

它包含 jar 文件和相关论文供您参考。

You might want to check the following: Gscaler: https://github.com/jayCool/Gscaler
This is a recent tool which produces synthetic scaled graphs.

It contains the jar file and the related paper for your reference.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文