递归函数多线程一次执行一项任务

发布于 2024-09-05 17:38:33 字数 140 浏览 4 评论 0原文

我正在编写一个程序来抓取网站。爬取功能是递归的,可能会消耗更多的时间来完成,所以我使用多线程来执行多个网站的爬取。 我真正需要的是,在完成爬行一个网站后,它会调用下一个网站(应该在 Queqe 中),而不是一次爬行多个网站。 我正在使用 C# 和 ASP.NET。

I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites.
What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time.
I am using C# and ASP.NET.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜灵血窟げ 2024-09-12 17:38:33

执行此操作的标准做法是使用阻塞队列。如果您使用 .NET 4.0,则可以利用 BlockingCollection 类,否则您可以使用 Stephen Toub 的实现。

您要做的就是根据需要启动尽可能多的工作线程,并让它们在无限循环中循环,当项目出现在队列中时将其出队。您的主线程将对该项目进行排队。阻塞队列设计用于等待/阻塞出队操作,直到有项目可用。

public class Program
{
  private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();

  public static void Main()
  {
    var thread1 = new Thread(Process);
    var thread2 = new Thread(Process);
    thread1.Start();
    thread2.Start();
    while (true)
    {
      string url = GetNextUrl();
      m_Queue.Enqueue(url);
    }
  }

  public static void Process()
  {
    while (true)
    {
      string url = m_Queue.Dequeue();
      // Do whatever with the url here.
    }
  }
}

The standard practice for doing this is to use a blocking queue. If you are using .NET 4.0 then you can take advantage of the BlockingCollection class otherwise you can use Stephen Toub's implementation.

What you will do is spin up as many worker threads as you feel necessary and have them go around in an infinite loop dequeueing items as they appear in the queue. Your main thread will be enqueueing the item. A blocking queue is designed to wait/block on the dequeue operation until an item becomes available.

public class Program
{
  private static BlockingQueue<string> m_Queue = new BlockingQueue<string>();

  public static void Main()
  {
    var thread1 = new Thread(Process);
    var thread2 = new Thread(Process);
    thread1.Start();
    thread2.Start();
    while (true)
    {
      string url = GetNextUrl();
      m_Queue.Enqueue(url);
    }
  }

  public static void Process()
  {
    while (true)
    {
      string url = m_Queue.Dequeue();
      // Do whatever with the url here.
    }
  }
}
罪#恶を代价 2024-09-12 17:38:33

当谈到网络爬虫时,我通常不会有积极的想法......

你想使用 线程池

 ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);

您只需将工作负载“推入”队列,然后让线程池管理它。

I don't usually think positive thoughts when it comes to web crawlers...

You want to use a threadpool.

 ThreadPool.QueueUserWorkItem(new WaitCallback(CrawlSite), (object)s);

You simply 'push' you workload into the queue, and let the threadpool manage it.

温柔少女心 2024-09-12 17:38:33

我不得不说 - 我不是线程专家,我的 C# 也很生疏 - 但考虑到需求,我会建议这样的事情:

  1. 为网站定义一个队列。
  2. 使用Crawler 线程定义一个池。
  3. 主进程迭代网站队列并检索网站地址。
  4. 从池中检索可用线程 - 为其分配网站地址并允许其开始运行。在线程对象中设置一个指示器,指示它应该等待所有后续线程完成(这样您就不会继续到下一个站点)。
  5. 一旦所有线程结束,主线程(在步骤 #4 中启动)将结束并返回到主进程的主循环以继续下一个网站。

Crawler 行为应该是这样的:

  1. 调查当前地址的内容
  2. 检索当前级别以下的层次结构
  3. 对于站点树当前节点的每个子节点 - 拉取一个新的crawler< /code> 从池中获取线程,并使用子节点的地址在后台启动它运行。
  4. 如果池为空,则等待,直到有线程可用。
  5. 如果线程被标记为等待 - 等待所有其他线程完成,

我认为这里存在一些挑战 - 但作为一般流程,我相信它可以完成工作。

I have to say - I'm not a Threading expert and my C# is quite rusty - but considering the requirements I would suggest something like this:

  1. Define a Queue for the websites.
  2. Define a Pool with Crawler threads.
  3. The main process iterates over the website queue and retrieves the site address.
  4. Retrieve an available thread from the pool - assign it the website address and allow it to start running. Set an indicator in the thread object that it should wait for all subsequent threads to finish (so you will not continue to the next site).
  5. Once all the threads have ended - the main thread (started in step #4) will end and return to the main loop of the main process to continue to the next website.

The Crawler behavior should be something like this:

  1. Investigate the content of the current address
  2. Retrieve the hierarchy below the current level
  3. For each child of the current node of the site tree - pull a new crawler thread from the pool and start it running in the background with the address of the child node
  4. If the pool is empty, wait until a thread becomes available.
  5. If the thread is marked to wait - wait for all the other threads to finish

I think there are some challenges here - but as a general flow I believe it can do do job.

川水往事 2024-09-12 17:38:33

将所有网址放入队列中,每次处理完前一个网址后,从队列中弹出一个网址。

您还可以将递归链接放入队列中,以更好地控制一次执行的下载数量。

您可以设置 X 个工作线程,这些线程都从队列中获取一个 url,以便一次处理更多内容。但这样你就可以自己节制它。

您可以在 .Net 中使用 ConcurrentQueue 来获取要使用的线程安全队列。

Put all your url's in a queue, and pop one off the queue each time you are done with the previous one.

You could also put the recursive links in the queue, to better control how many downloads you are executing at a time.

You could set up X number of worker threads which all get a url off the queue in order to process more at a time. But this way you can throttle it yourself.

You can use ConcurrentQueue<T> in .Net to get a thread safe queue to work with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文