良好的图遍历算法

发布于 2024-08-02 06:01:31 字数 1496 浏览 5 评论 0原文

抽象问题:我有一个包含大约 250,000 个节点的图,平均连接数约为 10。查找节点的连接是一个漫长的过程(比如说 10 秒)。将节点保存到数据库也需要大约 10 秒。我可以非常快速地检查数据库中是否已存在节点。允许并发,但一次不能超过 10 个长请求,您将如何遍历该图以最快的速度获得最高的覆盖率。

具体问题:我正在尝试抓取网站用户页面。为了发现新用户,我从已知用户那里获取好友列表。我已经导入了大约 10% 的图表,但我一直陷入循环或使用太多内存记住太多节点。

我当前的实现:

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

当前实现的问题:

  • 陷入我已经导入的派系中,从而浪费时间并且导入线程处于空闲状态。
  • 当他们指出时会添加更多内容。

因此,边际改进以及全面重写都是受欢迎的。谢谢!

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

My current implementation :

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

Problems of current implementation :

  • Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.
  • Will add more as they get pointed out.

So, marginal improvments are welcome, as well as full rewrites. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

゛时过境迁 2024-08-09 06:01:31

要记住您已经访问过的用户的 ID,您需要一个长度为 250,000 个整数的映射。这距离“太多”还很远。只需维护这样一个地图,并仅遍历通向尚未发现的用户的边缘,在找到此类边缘时将它们添加到该地图中。

据我所知,您已经接近实现广度优先搜索(BFS)了。检查谷歌有关该算法的详细信息。当然,不要忘记互斥体——您将需要它们。

To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.

As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.

伏妖词 2024-08-09 06:01:31

我真的很困惑为什么需要 10 秒才能将节点添加到数据库。这听起来像是一个问题。你使用什么数据库?你们有严格的平台限制吗?

对于现代系统及其大量内存,我会推荐某种简单的缓存。您应该能够创建非常快速的用户信息缓存,从而避免重复工作。当您已经遇到一个节点时,停止处理。这将避免永远在派系中循环。

如果您需要在一段时间后重新散列现有节点,您可以使用last_visit_number,它是 dB 中的全局值。如果该节点具有该编号,则该爬网就是遇到该编号的节点。如果你想自动重新访问任何节点,你只需要在开始抓取之前增加last_visit_number。

根据你的描述,我不太确定你是如何陷入困境的。

编辑 - - -
我刚刚注意到你有一个具体的问题。为了提高提取新数据的速度,我会跟踪给定用户在数据中链接的次数(已导入或尚未导入)。在选择要抓取的用户时,我会选择链接数量较少的用户。我会特别选择链接数量最少的用户,或者在链接数量最少的用户中随机选择。

雅各布

I am really confused as to why it takes 10 seconds to add a node to the DB. That sounds like a problem. What database are you using? Do you have severe platform restrictions?

With modern systems, and their oodles of memory, I would recommend a nice simple cache of some kind. You should be able to create a very quick cache of user information that would allow you to avoid repeated work. When you have encountered a node already, stop processing. This will avoid cycling forever in cliques.

If you need to allow for rehashing existing nodes after a while, you can use a last_visit_number which would be a global value in the dB. If the node has that number, then this crawl is the one that encountered it. If you want to automatically revisit any nodes, you just need to bump the last_visit_number before starting the crawl.

By your description, I am not quite sure how you are getting stuck.

Edit ------
I just noticed you had a concrete question. In order to increase how quickly you pull in new data, I would keep track of the number of times a given user was linked to in your data (imported or not yet imported). When choosing a user to crawl, I would pick users that have a low number of links. I would specifically go for either the lowest number of links or a random choice among the users with the lowest number of links.

Jacob

爺獨霸怡葒院 2024-08-09 06:01:31

没有特定的算法可以帮助您从头开始优化图的构造。不管怎样,您必须至少访问每个节点一次。无论您是深度优先还是广度优先 从速度角度来看是无关紧要的。 Theran 在下面的评论中正确指出,通过首先探索较近的节点,广度优先搜索可能会给您带来更有用的结果在整个图表完成之前立即绘制图表;这可能是也可能不是您关心的问题。他还指出,深度优先搜索的最简洁版本是使用递归实现的,这对您来说可能是一个问题。但请注意,递归不是必需的;如果您愿意,您可以将未完全探索的节点添加到堆栈中并线性处理它们。

如果您对新节点进行简单的存在性检查(如果您使用哈希进行查找,则为 O(1)),那么循环根本不会成为问题。如果您不存储完整的图表,则仅需要考虑循环。您可以通过图表优化搜索,但构建步骤本身始终需要线性时间。

我同意其他发帖者的观点,即图表的大小不应该成为问题。 25万并不是很大!

关于并发执行;该图由所有线程更新,因此它需要是同步的数据结构。由于这是Python,您可以使用 Queue 模块来存储新链接由您的线程处理。

There is no particular algorithm that will help you optimise the construction of a graph from scratch. One way or another, you are going to have to visit each node at least once. Whether you do this depth first or breadth first is irrelevant from a speed perspective. Theran correctly points out in a comment below that breadth-first search, by exploring nearer nodes first, may give you a more useful graph immediately, before the whole graph is completed; this may or may not be a concern for you. He also notes that the neatest version of depth-first search is implemented using recursion, which could potentially be a problem for you. Note that recursion is not required, however; you can add incompletely explored nodes to a stack and process them linearly if you wish.

If you do a simple existence check for new nodes (O(1) if you use a hash for lookup), then cycles will not be a problem at all. Cycles are only a concern if you do not store the complete graph. You can optimise searches through the graph, but the construction step itself will always take linear time.

I agree with other posters that the size of your graph should not be a problem. 250,000 is not very large!

Regarding concurrent execution; the graph is updated by all threads, so it needs to be a synchronised data structure. Since this is Python, you can make use of the Queue module to store new links still to be processed by your threads.

澉约 2024-08-09 06:01:31

尽管您说获取朋友列表需要花费大量时间(10 秒或更多),但古老的 Dijkstra 算法的一个变体可能会起作用:

  1. 获取任何节点。
  2. 从您已加载的任何节点获取连接。
  3. 如果另一端尚未加载,则将该节点添加到图中。
  4. 转到步骤 2。

技巧是以智能方式选择在步骤 2 中加载的连接。关于这一点的一些简短评论:

  • 您应该以某种方式防止相同的连接被加载两次或更频繁。如果您需要所有连接,则选择随机连接并在已加载的情况下将其丢弃是非常低效的。
  • 如果要最终加载所有连接,请同时加载某个节点的所有连接。

为了真正说明效率,请提供有关数据结构的更多详细信息。

Although you say that getting a friend list takes a lot of time (10 seconds or more), a variant of good-old Dijkstra's algorithm just might work:

  1. Get any node.
  2. Get a connection from any node you already loaded.
  3. If the other end hasn't been loaded yet, add the node to the graph.
  4. Go to step 2.

The trick is to select the connection you load in step 2 in a smart way. A few short remarks about this:

  • You should somehow prevent the same connection to be loaded twice or more often. Selecting a random connection and discard it if it's been loaded already is very inefficient if you're after all connections.
  • If you want to load all connections eventually, load all connections of a node at the same time.

In order to really say something about efficiency, please provide more details about datastructure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文