递归函数多线程一次执行一项任务
我正在编写一个程序来抓取网站。爬取功能是递归的,可能会消耗更多的时间来完成,所以我使用多线程来执行多个网站的爬取。 我真正需要的是,在完成爬行一个网站后,它会调用下一个网站(应该在 Queqe 中),而不是一次爬行多个网站。 我正在使用 C# 和 ASP.NET。
I am writing a program to crawl the websites. The crawl function is a recursive one and may consume more time to complete, So I used Multi Threading to perform the crawl for multiple websites.
What exactly I need is, after completion crawling one website it call next one (which should be in Queqe) instead multiple websites crawling at a time.
I am using C# and ASP.NET.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
执行此操作的标准做法是使用阻塞队列。如果您使用 .NET 4.0,则可以利用 BlockingCollection 类,否则您可以使用 Stephen Toub 的实现。
您要做的就是根据需要启动尽可能多的工作线程,并让它们在无限循环中循环,当项目出现在队列中时将其出队。您的主线程将对该项目进行排队。阻塞队列设计用于等待/阻塞出队操作,直到有项目可用。
The standard practice for doing this is to use a blocking queue. If you are using .NET 4.0 then you can take advantage of the BlockingCollection class otherwise you can use Stephen Toub's implementation.
What you will do is spin up as many worker threads as you feel necessary and have them go around in an infinite loop dequeueing items as they appear in the queue. Your main thread will be enqueueing the item. A blocking queue is designed to wait/block on the dequeue operation until an item becomes available.
当谈到网络爬虫时,我通常不会有积极的想法......
你想使用 线程池。
您只需将工作负载“推入”队列,然后让线程池管理它。
I don't usually think positive thoughts when it comes to web crawlers...
You want to use a threadpool.
You simply 'push' you workload into the queue, and let the threadpool manage it.
我不得不说 - 我不是线程专家,我的 C# 也很生疏 - 但考虑到需求,我会建议这样的事情:
Crawler
线程定义一个池。Crawler
行为应该是这样的:crawler< /code> 从池中获取线程,并使用子节点的地址在后台启动它运行。
我认为这里存在一些挑战 - 但作为一般流程,我相信它可以完成工作。
I have to say - I'm not a Threading expert and my C# is quite rusty - but considering the requirements I would suggest something like this:
Crawler
threads.The
Crawler
behavior should be something like this:crawler
thread from the pool and start it running in the background with the address of the child nodeI think there are some challenges here - but as a general flow I believe it can do do job.
将所有网址放入队列中,每次处理完前一个网址后,从队列中弹出一个网址。
您还可以将递归链接放入队列中,以更好地控制一次执行的下载数量。
您可以设置 X 个工作线程,这些线程都从队列中获取一个 url,以便一次处理更多内容。但这样你就可以自己节制它。
您可以在 .Net 中使用
ConcurrentQueue
来获取要使用的线程安全队列。Put all your url's in a queue, and pop one off the queue each time you are done with the previous one.
You could also put the recursive links in the queue, to better control how many downloads you are executing at a time.
You could set up X number of worker threads which all get a url off the queue in order to process more at a time. But this way you can throttle it yourself.
You can use
ConcurrentQueue<T>
in .Net to get a thread safe queue to work with.