如何识别网络中的节点簇
我有一个表描述了几组连接的节点:
node
origin_node REFERENCES node
start_time
end_time
并且我想找出数据集包含多少个集群,例如如果记录是:
A, B, 10:00, 11:00
B, C, 9:00, 9:15
D, E, 10:00, 10:15
B, A, 13:00, 13:30
E, B, 12:00, 13:20
F, G, 9:00, 9:15
...那么我将有 2 个集群 {A,B,C,D,E } 和 {F,G}
(时间几乎无关紧要 - 它只是为了证明 node+origin_node 不一定是唯一/有序的)。
但我在制定一种从几千行中识别簇的算法时有点陷入困境。
我正在使用 MySQL 5.0.22 - 所以没有“CONNECT BY”,并且可以访问 PHP 和 awk - 尽管对我来说理解算法比理解编码解决方案更容易。只要分析数据的时间不超过几个小时,我就会倾向于简单而不是顺序。
顺便说一句:这是一个现实世界的问题 - 不是家庭作业(我很久以前就不再是学生了 - 也许太早了;)
TIA
I have a table describing several sets of connected nodes:
node
origin_node REFERENCES node
start_time
end_time
and I want to find out how many clusters the dataset contains, e.g. if the records were:
A, B, 10:00, 11:00
B, C, 9:00, 9:15
D, E, 10:00, 10:15
B, A, 13:00, 13:30
E, B, 12:00, 13:20
F, G, 9:00, 9:15
...then I'd have 2 clusters {A,B,C,D,E} and {F,G}
(the times are pretty much irrelevant - it's just there to demonstrate that node+origin_node is not necessarily unique / ordered).
But I'm a bit stuck in working out an algorithm which identifies the clusters from a few thousand rows.
I'm working with MySQL 5.0.22 - so no 'CONNECT BY', and have access to PHP and awk - although it'd be easier for me to understand an algorithm rather than a coded solution. And as long as it takes less than a couple of hours to analyse the data, I'd lean to simplicity over order.
BTW: its a real-world problem - not homework (I stopped being a student a long time ago - perhaps too early ;)
TIA
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试这些链接吗?
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/Category:Data_clustering_algorithms
另外,虽然不是 MySQL,但 Microsoft 上也有一些东西站点:
http://msdn.microsoft.com/en-us/library/ms174879 .aspx
根据您的评论进行编辑:
在您的特定情况下,类似于创建闭包表的操作可能会起作用。
使用临时表...
从任意节点开始。将其分配给新集群。
下一个节点。是否存在从当前识别的集群到节点的链接?
如果不是,则将其分配给新的集群。
如果是,则将其分配给该集群。然后,对于每个链接,验证已处理的节点是否位于同一集群中。如果不是,请将它们重新分配到该集群。
Tried these links?
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/Category:Data_clustering_algorithms
Also, though not MySQL, there also is stuff on Microsoft's site:
http://msdn.microsoft.com/en-us/library/ms174879.aspx
Edit, per your comment:
In your particular case, something akin to creating a closure table might work.
Using a temporary table...
Start with an arbitrary node. Assign it to a new cluster.
Next node. Is there a link to a node from a currently identified cluster?
If no, assign it to a new cluster.
If yes, assign it to that cluster. Then, for each link, verify that already processed node is in the same cluster. If not, reassign them to that cluster.
遍历网络并标记访问过的节点(类似于垃圾收集算法)。它相当高效,但需要相当多的代码。
Went with walking the network and flagging visited nodes (similar to garbage collection algorithms). Its reasonably efficient but needed quite a bit of code.