递归函数多线程一次执行一项任务

发布于 2024-09-05 17:38:33 字数 140 浏览 4 评论 0原文

我正在编写一个程序来抓取网站。爬取功能是递归的，可能会消耗更多的时间来完成，所以我使用多线程来执行多个网站的爬取。我真正需要的是，在完成爬行一个网站后，它会调用下一个网站（应该在 Queqe 中），而不是一次爬行多个网站。我正在使用 C# 和 ASP.NET。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜灵血窟げ 2024-09-12 17:38:33

执行此操作的标准做法是使用阻塞队列。如果您使用 .NET 4.0，则可以利用 BlockingCollection 类，否则您可以使用 Stephen Toub 的实现。

您要做的就是根据需要启动尽可能多的工作线程，并让它们在无限循环中循环，当项目出现在队列中时将其出队。您的主线程将对该项目进行排队。阻塞队列设计用于等待/阻塞出队操作，直到有项目可用。

public class Program
{
  private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();

  public static void Main()
  {
    var thread1 = new Thread(Process);
    var thread2 = new Thread(Process);
    thread1.Start();
    thread2.Start();
    while (true)
    {
      string url = GetNextUrl();
      m_Queue.Enqueue(url);
    }
  }

  public static void Process()
  {
    while (true)
    {
      string url = m_Queue.Dequeue();
      // Do whatever with the url here.
    }
  }
}

The standard practice for doing this is to use a blocking queue. If you are using .NET 4.0 then you can take advantage of the BlockingCollection class otherwise you can use Stephen Toub's implementation.

What you will do is spin up as many worker threads as you feel necessary and have them go around in an infinite loop dequeueing items as they appear in the queue. Your main thread will be enqueueing the item. A blocking queue is designed to wait/block on the dequeue operation until an item becomes available.

public class Program
{
  private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();

  public static void Main()
  {
    var thread1 = new Thread(Process);
    var thread2 = new Thread(Process);
    thread1.Start();
    thread2.Start();
    while (true)
    {
      string url = GetNextUrl();
      m_Queue.Enqueue(url);
    }
  }

  public static void Process()
  {
    while (true)
    {
      string url = m_Queue.Dequeue();
      // Do whatever with the url here.
    }
  }
}

回复收藏 0 原文

罪#恶を代价 2024-09-12 17:38:33

当谈到网络爬虫时，我通常不会有积极的想法......

你想使用线程池。

 ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);

您只需将工作负载“推入”队列，然后让线程池管理它。

I don't usually think positive thoughts when it comes to web crawlers...

You want to use a threadpool.

 ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);

You simply 'push' you workload into the queue, and let the threadpool manage it.

回复收藏 0 原文

温柔少女心 2024-09-12 17:38:33

我不得不说 - 我不是线程专家，我的 C# 也很生疏 - 但考虑到需求，我会建议这样的事情：

为网站定义一个队列。
使用Crawler 线程定义一个池。
主进程迭代网站队列并检索网站地址。
从池中检索可用线程 - 为其分配网站地址并允许其开始运行。在线程对象中设置一个指示器，指示它应该等待所有后续线程完成（这样您就不会继续到下一个站点）。
一旦所有线程结束，主线程（在步骤 #4 中启动）将结束并返回到主进程的主循环以继续下一个网站。

Crawler 行为应该是这样的：

调查当前地址的内容
检索当前级别以下的层次结构
对于站点树当前节点的每个子节点 - 拉取一个新的crawler< /code> 从池中获取线程，并使用子节点的地址在后台启动它运行。
如果池为空，则等待，直到有线程可用。
如果线程被标记为等待 - 等待所有其他线程完成，

我认为这里存在一些挑战 - 但作为一般流程，我相信它可以完成工作。

回复收藏 0 原文

川水往事 2024-09-12 17:38:33

将所有网址放入队列中，每次处理完前一个网址后，从队列中弹出一个网址。

您还可以将递归链接放入队列中，以更好地控制一次执行的下载数量。

您可以设置 X 个工作线程，这些线程都从队列中获取一个 url，以便一次处理更多内容。但这样你就可以自己节制它。

您可以在 .Net 中使用 ConcurrentQueue 来获取要使用的线程安全队列。

回复收藏 0 原文

~没有更多了~

关于作者

指尖上得阳光

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

递归函数多线程一次执行一项任务

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

当谈到网络爬虫时，我通常不会有积极的想法......

I don't usually think positive thoughts when it comes to web crawlers...

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

递归函数多线程一次执行一项任务

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

当谈到网络爬虫时，我通常不会有积极的想法......

I don't usually think positive thoughts when it comes to web crawlers...

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。