良好的图遍历算法

发布于 2024-08-02 06:01:31 字数 1496 浏览 11 评论 0原文

抽象问题：我有一个包含大约 250,000 个节点的图，平均连接数约为 10。查找节点的连接是一个漫长的过程（比如说 10 秒）。将节点保存到数据库也需要大约 10 秒。我可以非常快速地检查数据库中是否已存在节点。允许并发，但一次不能超过 10 个长请求，您将如何遍历该图以最快的速度获得最高的覆盖率。

具体问题：我正在尝试抓取网站用户页面。为了发现新用户，我从已知用户那里获取好友列表。我已经导入了大约 10% 的图表，但我一直陷入循环或使用太多内存记住太多节点。

我当前的实现：

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

当前实现的问题：

陷入我已经导入的派系中，从而浪费时间并且导入线程处于空闲状态。
当他们指出时会添加更多内容。

因此，边际改进以及全面重写都是受欢迎的。谢谢！

原文

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

My current implementation :

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

Problems of current implementation :

Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.
Will add more as they get pointed out.

So, marginal improvments are welcome, as well as full rewrites. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

゛时过境迁 2024-08-09 06:01:31

要记住您已经访问过的用户的 ID，您需要一个长度为 250,000 个整数的映射。这距离“太多”还很远。只需维护这样一个地图，并仅遍历通向尚未发现的用户的边缘，在找到此类边缘时将它们添加到该地图中。

据我所知，您已经接近实现广度优先搜索（BFS）了。检查谷歌有关该算法的详细信息。当然，不要忘记互斥体——您将需要它们。

回复收藏 0 原文

伏妖词 2024-08-09 06:01:31

我真的很困惑为什么需要 10 秒才能将节点添加到数据库。这听起来像是一个问题。你使用什么数据库？你们有严格的平台限制吗？

对于现代系统及其大量内存，我会推荐某种简单的缓存。您应该能够创建非常快速的用户信息缓存，从而避免重复工作。当您已经遇到一个节点时，停止处理。这将避免永远在派系中循环。

如果您需要在一段时间后重新散列现有节点，您可以使用last_visit_number，它是 dB 中的全局值。如果该节点具有该编号，则该爬网就是遇到该编号的节点。如果你想自动重新访问任何节点，你只需要在开始抓取之前增加last_visit_number。

根据你的描述，我不太确定你是如何陷入困境的。

编辑 - - -
我刚刚注意到你有一个具体的问题。为了提高提取新数据的速度，我会跟踪给定用户在数据中链接的次数（已导入或尚未导入）。在选择要抓取的用户时，我会选择链接数量较少的用户。我会特别选择链接数量最少的用户，或者在链接数量最少的用户中随机选择。

雅各布

回复收藏 0 原文

爺獨霸怡葒院 2024-08-09 06:01:31

没有特定的算法可以帮助您从头开始优化图的构造。不管怎样，您必须至少访问每个节点一次。无论您是深度优先还是广度优先从速度角度来看是无关紧要的。 Theran 在下面的评论中正确指出，通过首先探索较近的节点，广度优先搜索可能会给您带来更有用的结果在整个图表完成之前立即绘制图表；这可能是也可能不是您关心的问题。他还指出，深度优先搜索的最简洁版本是使用递归实现的，这对您来说可能是一个问题。但请注意，递归不是必需的；如果您愿意，您可以将未完全探索的节点添加到堆栈中并线性处理它们。

如果您对新节点进行简单的存在性检查（如果您使用哈希进行查找，则为 O(1)），那么循环根本不会成为问题。如果您不存储完整的图表，则仅需要考虑循环。您可以通过图表优化搜索，但构建步骤本身始终需要线性时间。

我同意其他发帖者的观点，即图表的大小不应该成为问题。 25万并不是很大！

关于并发执行；该图由所有线程更新，因此它需要是同步的数据结构。由于这是Python，您可以使用 Queue 模块来存储新链接由您的线程处理。