在巨大的稀疏矩阵中找到所有循环
首先,我是一个Java初学者,所以我不确定这是否可能!基本上我有一个巨大的(3+00万)关系数据数据源(即A是B+C+D的朋友,B是D+G+Z的朋友(但不是A - 即不相互)等),我想要找到这个(不一定是连通的)有向图中的每个循环。
我找到了线程查找图中的所有周期,其中有向我指出了唐纳德·约翰逊(Donald Johnson)的(基本)周期查找算法,至少从表面上看,它看起来会做我想做的事情(我周二回来工作时会尝试 - 我认为它不会)同时问一下也没什么坏处!)。
我快速浏览了约翰逊算法的 Java 实现的代码(在该线程中),看起来关系矩阵是第一步,所以我想我的问题是:
a)Java 是否能够处理 3+百万*3+百万矩阵? (计划用二进制稀疏矩阵表示 A 朋友与 B )
b)我是否需要找到每个连接的子图作为我的第一个问题,或者循环查找算法会处理不相交的数据吗?
c) 这实际上是问题的适当解决方案吗?我对“基本”循环的理解是,在下图中,它不会选择 ABCDEF,而是会选择 ABF、BCD 等。但这并不是任务的世界末日。
E
/ \
D---F
/ \ / \
C---B---A
d) 如果有必要,我可以通过加强关系中的相互性来简化问题 - 即 A-friends-with-B <==> B-friends-with-A,如果真的有必要,我也许可以减少数据大小,但实际上它总是在 100 万左右。
z) 这是 P 任务还是 NP 任务?!我是否贪多嚼不烂?
谢谢大家,任何帮助表示赞赏! 安迪
First of all I'm quite a Java beginner, so I'm not sure if this is even possible! Basically I have a huge (3+million) data source of relational data (i.e. A is friends with B+C+D, B is friends with D+G+Z (but not A - i.e. unmutual) etc.) and I want to find every cycle within this (not necessarily connected) directed graph.
I've found the thread Finding all cycles in graph, which has pointed me to Donald Johnson's (elementary) cycle-finding algorithm which, superficially at least, looks like it'll do what I'm after (I'm going to try when I'm back at work on Tuesday - thought it wouldn't hurt to ask in the meanwhile!).
I had a quick scan through the code of the Java implementation of Johnson's algorithm (in that thread) and it looks like a matrix of relations is the first step, so I guess my questions are:
a) Is Java capable of handling a 3+million*3+million matrix? (was planning on representing A-friends-with-B by a binary sparse matrix)
b) Do I need to find every connected subgraph as my first problem, or will cycle-finding algorithms handle disjoint data?
c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.
E
/ \
D---F
/ \ / \
C---B---A
d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.
z) Is this a P or NP task?! Am I biting off more than I can chew?
Thanks all, any help appreciated!
Andy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您正在做的事情类似于数据挖掘中一个经过深入研究的问题,称为关联规则挖掘或更普遍的频繁项集挖掘。通过频繁项集挖掘,您可以发现比您正在做的事情更具体一些,但也更有用。
我们将进行封闭频繁项集挖掘,这将找到所有朋友组,其中每个人都是彼此的朋友。
我现在要说的是,Java 不能做你想做的事。它无法加载那么多内存,并且效率不够高,无法在合理的时间内处理这些数据,您将需要使用 C/C++。我建议使用 LCM,它是一个封闭的频繁项集挖掘器,但由于您拥有的数据量,您还需要将支持设置得相当高。
您可能需要考虑的另一件事是阅读大型图挖掘,这也是一个相当大的研究领域,但 Java 不会削减它。此外,您将无法找到数据中的所有周期(除非数据非常稀疏),它们的数量将会太多。它们也会重叠并且意义不大,您可能会找到几个最大的周期。
What you're doing is similar to a very well studied problem in data mining, known as association rule mining or more generally frequent itemset mining. What you can find with frequent itemset mining, is a little bit more specific than what you're doing, but also more useful.
We'll go with closed frequent itemset mining, what this will do is find all groups of friends, where everyone is friends with each other.
I'm going to say this right now, that Java can't do what you want it to do. It can't load that much memory and it's not efficient enough to process that data in any reasonable amount of time, you're going to need to use C/C++. I'd suggest using LCM which is a closed frequent itemset miner, but you're also going to need set the support pretty high, because of the amount of data that you have.
Another thing you might want to consider is reading up on Large Graph Mining, which is also a fairly large area of research, but Java isn't going to cut it. Also you're not going to be able to find all the cycles in your data (unless it is incredibly sparse), there's going to be far too many of them. They're also going to be overlapping and not very meaningful, what you're going to maybe possibly be able to find is several of the largest cycles.
不会。唐纳德·约翰逊论文中“Elementary”的意思是“简单”,即循环中没有节点出现两次。这意味着算法不会选择
AFBCDBA
,而是选择ABCDEF
。无向图具有(非简单)循环的向量空间(有很好的基础等),但我不知道它是否会对您有帮助。
这是一个枚举问题,其本身既不能属于 P 也不能属于 NP。
No. "Elementary" in the sense of Donald Johnson's paper means simple, that is, no node appears twice in the circle. That means the algorithm won't pick
AFBCDBA
, but will pickABCDEF
.Undirected graphs have vector space of (non-simple) cycles (which has a nice basis, etc.), but I don't know if it'll help you.
It is an enumeration problem, which, by itself, cannot be in P nor NP.
找到每个周期听起来并不合理。将会有指数数量的循环,所有循环都相互重叠。
至于P或NP,最慢的部分实际上是枚举每个循环(因为可能有很多个循环)。如果没有循环,则存在线性算法。
也许您实际上想将图划分为双连通分量?
请参阅 http://en.wikipedia.org/wiki/Biconnected_component
而且它几乎从来不合理将图形存储在矩阵中。理论上听起来不错,但在实践中无法扩展,请改用邻接列表(http://en. wikipedia.org/wiki/Adjacency_list)
Finding every cycle does not sound reasonable. There will be exponential number of cycles, all overlapping each other.
As for P or NP, the slowest part is actually enumerating each cycle (because there can be so many of them). If there are no cycles, a linear algorithm exists.
Maybe you actually want to divide the graph in biconnected components?
See http://en.wikipedia.org/wiki/Biconnected_component
Also it is almost NEVER reasonable to store graphs in matrices. It sounds nice in theory, but does not scale in practice, Use adjacency lists instead ( http://en.wikipedia.org/wiki/Adjacency_list )