在巨大的稀疏矩阵中找到所有循环

发布于 2024-09-03 15:45:38 字数 868 浏览 8 评论 0原文

首先，我是一个Java初学者，所以我不确定这是否可能！基本上我有一个巨大的（3+00万）关系数据数据源（即A是B+C+D的朋友，B是D+G+Z的朋友（但不是A - 即不相互）等），我想要找到这个（不一定是连通的）有向图中的每个循环。

我找到了线程查找图中的所有周期，其中有向我指出了唐纳德·约翰逊（Donald Johnson）的（基本）周期查找算法，至少从表面上看，它看起来会做我想做的事情（我周二回来工作时会尝试 - 我认为它不会）同时问一下也没什么坏处！）。

我快速浏览了约翰逊算法的 Java 实现的代码（在该线程中），看起来关系矩阵是第一步，所以我想我的问题是：

a）Java 是否能够处理 3+百万*3+百万矩阵？（计划用二进制稀疏矩阵表示 A 朋友与 B ）

b）我是否需要找到每个连接的子图作为我的第一个问题，或者循环查找算法会处理不相交的数据吗？

c) 这实际上是问题的适当解决方案吗？我对“基本”循环的理解是，在下图中，它不会选择 ABCDEF，而是会选择 ABF、BCD 等。但这并不是任务的世界末日。

    E
   / \
  D---F
 / \ / \
C---B---A

d) 如果有必要，我可以通过加强关系中的相互性来简化问题 - 即 A-friends-with-B <==> B-friends-with-A，如果真的有必要，我也许可以减少数据大小，但实际上它总是在 100 万左右。

z) 这是 P 任务还是 NP 任务？！我是否贪多嚼不烂？

谢谢大家，任何帮助表示赞赏！安迪

原文

First of all I'm quite a Java beginner, so I'm not sure if this is even possible! Basically I have a huge (3+million) data source of relational data (i.e. A is friends with B+C+D, B is friends with D+G+Z (but not A - i.e. unmutual) etc.) and I want to find every cycle within this (not necessarily connected) directed graph.

I've found the thread Finding all cycles in graph, which has pointed me to Donald Johnson's (elementary) cycle-finding algorithm which, superficially at least, looks like it'll do what I'm after (I'm going to try when I'm back at work on Tuesday - thought it wouldn't hurt to ask in the meanwhile!).

I had a quick scan through the code of the Java implementation of Johnson's algorithm (in that thread) and it looks like a matrix of relations is the first step, so I guess my questions are:

a) Is Java capable of handling a 3+million*3+million matrix? (was planning on representing A-friends-with-B by a binary sparse matrix)

b) Do I need to find every connected subgraph as my first problem, or will cycle-finding algorithms handle disjoint data?

c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.

    E
   / \
  D---F
 / \ / \
C---B---A

d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.

z) Is this a P or NP task?! Am I biting off more than I can chew?

Thanks all, any help appreciated!
Andy

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

高速公鹿 2024-09-10 15:45:38

您正在做的事情类似于数据挖掘中一个经过深入研究的问题，称为关联规则挖掘或更普遍的频繁项集挖掘。通过频繁项集挖掘，您可以发现比您正在做的事情更具体一些，但也更有用。

我们将进行封闭频繁项集挖掘，这将找到所有朋友组，其中每个人都是彼此的朋友。

我现在要说的是，Java 不能做你想做的事。它无法加载那么多内存，并且效率不够高，无法在合理的时间内处理这些数据，您将需要使用 C/C++。我建议使用 LCM，它是一个封闭的频繁项集挖掘器，但由于您拥有的数据量，您还需要将支持设置得相当高。

您可能需要考虑的另一件事是阅读大型图挖掘，这也是一个相当大的研究领域，但 Java 不会削减它。此外，您将无法找到数据中的所有周期（除非数据非常稀疏），它们的数量将会太多。它们也会重叠并且意义不大，您可能会找到几个最大的周期。

回复收藏 0 原文

星 2024-09-10 15:45:38

c) 这实际上是合适的吗？
问题的解决方案？我的
对“基本”循环的理解
是在下图中，而是
而不是选择 ABCDEF 它将
选择 ABF、BCD 等，但那就是
鉴于
任务。
<前><代码> E
/\
D---F
/ \ / \
C---B---A

不会。唐纳德·约翰逊论文中“Elementary”的意思是“简单”，即循环中没有节点出现两次。这意味着算法不会选择 AFBCDBA，而是选择 ABCDEF。

d）如果有必要，我可以简化
通过强制执行相互性来解决问题
关系 - 即 A-与-B 的朋友 <==>
B-与-A的朋友，如果真的的话
有必要我也许可以减少
数据大小，但实际上是这样
总是在 100 万左右
标记。

无向图具有（非简单）循环的向量空间（有很好的基础等），但我不知道它是否会对您有帮助。

z) 这是 P 任务还是 NP 任务？

这是一个枚举问题，其本身既不能属于 P 也不能属于 NP。

c) Is this actually an appropriate
solution for the problem? My
understanding of "elementary" cycles
is that in the graph below, rather
than picking out A-B-C-D-E-F it'll
pick out A-B-F, B-C-D etc. but that's
not the end of the world given the
task.
    E
   / \
  D---F
 / \ / \
C---B---A

No. "Elementary" in the sense of Donald Johnson's paper means simple, that is, no node appears twice in the circle. That means the algorithm won't pick AFBCDBA, but will pick ABCDEF.

d) If necessary, I can simplify the
problem by enforcing mutuality in
relations - i.e. A-friends-with-B <==>
B-friends-with-A, and if really
necessary I can maybe cut down the
data size, but realistically it is
always going to be around the 1mil
mark.

Undirected graphs have vector space of (non-simple) cycles (which has a nice basis, etc.), but I don't know if it'll help you.