MapReduce、Python 和 NetworkX

发布于 2024-08-10 04:37:08 字数 1288 浏览 2 评论 0原文

我已经为使用 NetworkX 在 Python 中构建的图实现了未加权随机游走函数。下面是我处理随机游走的程序的片段。在我的程序的其他地方，我有一个创建图形的方法，并且有一个模拟我编写的各种自定义图形测试方法的方法。其中一种图测试方法从图中随机选取两个节点，并在它们之间运行随机游走。随机游走计算的两件事是点击时间（从起点到终点遍历的链接数量）和通勤时间（从起点到终点再回到起点遍历的链接数量））。

def unweighted_random_walk(starting_point,ending_point, graph):
    '''
    starting_point: String that represents the starting point in the graph
    ending_point: String that represents the ending point in the graph
    graph: A NetworkX Graph object
    '''
    ##Begin the random walk
    current_point=starting_point
    #current_node=graph[current_point]
    current_point_neighors=graph.neighbors(current_point)
    hitting_time=0

    #Determine the hitting time to get to an arbitrary neighbor of the
    #starting point
    while current_point!=ending_point:
        #pick one of the edges out of the starting_node with equal probs
        possible_destination=current_point_neighbors[random.randint(0,current_point_neighors)]
        current_point=possible_destination
        current_point_neighbors=graph.neighbors(current_point)
        hitting_time+=1
    return hitting_time

我的随机游走代码非常简单，因为我只是选择随机节点直到到达终点。然而，当我尝试运行几次随机游走时，当前的实现非常慢（我认为在某个时候我需要运行一百万次）。

我的问题是：有什么方法可以使用 Hadoop MapReduce 来并行化随机游走中正在进行的一些操作吗？有更好的方法让我进行随机游走吗？

原文

I have implemented an unweighted random walk function for a graph that I built in Python using NetworkX. Below is a snippet of my program that deals with the random walk. Elsewhere in my program, I have a method that creates the graph, and I have a method that simulates various custom graph testing methods that I've written. One of these graph testing methods picks two nodes at random from the graph and runs a random walk between both of them. The two things that are being calculated from this Random Walk are hitting time (the number of links that are traversed from the starting to the ending point) and the commute time (the number of traversed links from starting to ending and back to the starting point).

def unweighted_random_walk(starting_point,ending_point, graph):
    '''
    starting_point: String that represents the starting point in the graph
    ending_point: String that represents the ending point in the graph
    graph: A NetworkX Graph object
    '''
    ##Begin the random walk
    current_point=starting_point
    #current_node=graph[current_point]
    current_point_neighors=graph.neighbors(current_point)
    hitting_time=0

    #Determine the hitting time to get to an arbitrary neighbor of the
    #starting point
    while current_point!=ending_point:
        #pick one of the edges out of the starting_node with equal probs
        possible_destination=current_point_neighbors[random.randint(0,current_point_neighors)]
        current_point=possible_destination
        current_point_neighbors=graph.neighbors(current_point)
        hitting_time+=1
    return hitting_time

My code for the random walk is pretty straight-forward because I'm just picking random nodes until the ending point is reached. However, this current implementation is very slow when I try running several random walks (I think I need to run a million at some point).

My question is: Is there any way that I can use Hadoop MapReduce to parallelize some of the operations that are going on here for this Random Walk? Is there a better way for me to do my random walk?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟酉 2024-08-17 04:37:08

我不明白 map-reduce 能如何帮助你。它用于有两部分操作的情况：第一部分是可以对许多不同数据元素独立执行的计算，第二部分以某种方式组合所有这些结果。也许有一种聪明的方法可以使用 map-reduce 来帮助进行随机游走，但我没有看到它。

你的随机游走是完全随机的：它最终可能会产生许多循环，甚至在继续之前在相同的两个节点之间来回跳跃。也许您想以某种方式限制它，这样您就没有那么大的搜索空间？

回复收藏 0 原文

小…红帽 2024-08-17 04:37:08

要解决您的问题：

您需要解决 Ned 的评论。他比我先说这句话。解释你的代码；稍后会详细介绍。
我无法理解可以并行运行的步行算法。就其本质而言，它们都是一个线性过程。每一步都取决于前一步。如果不知道前一个节点（起始节点除外），您就不可能知道要跳转到的下一个节点。如果您的代码确实代表了随机游走，其中所有选择都独立于之前的选择，那么您需要在问题中解释这一点。
但是，假设每个随机游走都是独立的，您可以同时运行许多随机游走。我们称这种情况为令人尴尬的并行，这是一件非常幸运的事情。
我不知道你为什么要使用 Hadoop，特别是在这里。第一步应该是，“我可以将其编写为基本程序，并使用 qsub （或等效）脚本将该程序的一堆运行分配给服务器吗？”如果答案是否定的，下一步是“我可以使用多处理模块？”如果您使用多处理，您可能需要看看 Jesse Noller 的多处理PyCon 2009 的演示。

现在，关于您的特定代码...

您需要解释图表中的节点是什么。我很困惑为什么你把它们当作字典（调用 .keys()）。如果它们是字典，请告诉我们键和值是什么。我希望您不要将邻居存储为密钥，因为 NetworkX 已经通过 Graph.neighbors() 方法。如果您将节点的邻居存储在节点本身中，则您对 NetworkX 库有误解。让图表为您完成工作。
在 unweighted_random_walk() 中您有两次相同的逻辑，一次用于从起始节点到目标节点的行程，然后再次用于从目标节点到起始节点的行程。为什么？您所需要的只是一个方向的逻辑。调用该函数两次。使用起始节点和目标节点作为参数调用它以获取单向方向，然后交换目标节点的参数顺序，然后开始向另一个方向行走。然后，您有两个独立的调用，现在可以并行运行它们。
不要使用while True:——不仅在这里，而且在一般情况下。您应该始终指出继续的实际情况。例如，
```
当当前点！=结束点时：
    ...
```
不返回字符串信息，直接返回信息。例如，
```
返回hit_time
```
请注意，按照我在上面第 2 点中的建议，您只需返回击中时间，并将那里调用和回调用的击中时间相加即可获得总通勤时间。方便吧？

另请参阅

Disco 项目，了解 Python 可访问的 MapReduce 实现
Jesse Noller 在 PyCon 2009 上关于并发和分布式计算的演示

编辑： 包含 Jesse Noller 演示文稿和分布式计算的链接去迪斯科。

To address your question:

You need to address Ned's comment. He beat me to saying it. Explain your code; more on that later.
I can not fathom a walking algorithm that could be run in parallel. By their very nature, they are each a linear process; each step depends on the previous. You cannot possibly know what next node to jump to without knowing the previous node (with the exception of the starting node). If your code indeed represents a random walk where the choices are all independent of the previous ones, you need to explain that in your question.
Assuming each random walk is independent, however, you can run many random walks simultaneously. We call this scenario embarassingly parallel, and that's a very lucky thing.
I have no idea why you want to use Hadoop, specifically, here. The first step should be, "Can I just write this as a basic program and use a qsub (or equivalent) script to farm out a bunch of runs of this program to the server?" If the answer is no, the next step is, "Can I use the multiprocessing module?" If you go with multiprocessing, you might want to take a look at Jesse Noller's multiprocessing presentation from PyCon 2009.

Now, with regards to your particular code...

You need to explain what the nodes in your graph are. I'm confused why you're treating them like a dictionary (calling .keys()) on them. If they are dictionaries, tell us what the keys and values are. I hope you're not storing the neighbors as keys there, because NetworkX already gives you that, via the Graph.neighbors() method. If you're storing the neighbors of the nodes in the nodes themselves, you have a misunderstanding of the NetworkX library. Let the graph do the work for you.
You have the same logic twice in unweighted_random_walk(), once for the trip from the start node to the destination node, then again for the destination node to the start node. Why? All you need is the logic for one direction. Call this function twice. Call it with the start and destination nodes as arguments to get the direction one way, then swap the order of the arguments to be destination then start to get the walk the other direction. You then have two independent calls, and can now run these in parallel.
Don't use while True:—not just here, but in general. You should always indicate the actual condition under which to continue. e.g.,
```
while current_point != ending_point:
    ...
```
Don't return a string of the information, return the information directly. e.g.,
```
return hitting_time
```
Note that by following my advice in point 2 directly above, you only have to return the hitting time, and sum the hitting times for the there-call and the back-call to get the total commute time. Convenient, right?