使用 HttpWebRequest.BeginGetResponse 下载未定义数量的文件
我必须编写一个下载几千个文件的小应用程序。其中一些文件包含对必须作为同一过程的一部分下载的其他文件的引用。以下代码下载初始文件列表,但我想下载其他文件作为同一循环的一部分。这里发生的情况是循环在第一个请求返回之前完成。知道如何实现这一目标吗?
var countdownLatch = new CountdownEvent(Urls.Count);
string url;
while (Urls.TryDequeue(out url))
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}), webRequest);
}
ce.Wait();
I have to write a small app which downloads a few thousand files. Some of these files contain reference to other files that must be downloaded as part of the same process. The following code downloads the initial list of files, but I would like to download the others files as part of the same loop. What is happening here is that the loop completes before the first request come back. Any idea how to achieve this?
var countdownLatch = new CountdownEvent(Urls.Count);
string url;
while (Urls.TryDequeue(out url))
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}), webRequest);
}
ce.Wait();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想到的一种解决方案是跟踪待处理请求的数量,并且仅在没有待处理请求且 Url 队列为空时才结束循环:
One solution which comes to mind is to keep track of the number of pending requests and only finish the loop once no requests are pending and the Url queue is empty:
您正在尝试编写一个网络爬虫。为了编写一个好的网络爬虫,您首先需要定义一些参数...
1)您要同时下载多少个请求?换句话说,您想要多少吞吐量?这将决定诸如您想要处理多少个请求、线程池大小应该是多少等。
2) 您必须有一个 URL 队列。该队列由每个完成的请求填充。您现在需要决定队列的增长策略是什么。例如,您不能拥有无限制的队列,因为您将工作项放入队列的速度比从网络下载的速度要快。
鉴于此,您可以按如下方式设计一个系统:
拥有最多 N 个实际从网络下载的工作线程。他们从队列中取出一次,然后下载数据。它们解析数据并填充您的 URL 队列。
如果队列中的 URL 数量超过“M”个,则队列会阻塞并且不允许更多的 URL 排队。现在,您可以在这里执行以下两件事之一。您可以导致正在排队的线程阻塞,或者您可以直接丢弃正在排队的工作项。一旦另一个工作项在另一个线程上完成,并且 URL 出队,被阻止的线程现在将能够成功入队。
有了这样的系统,您可以确保在下载数据时不会耗尽系统资源。
实现:
请注意,如果您使用异步,那么您将使用额外的 I/O 线程来执行下载。这很好,只要你注意这个事实。您可以执行纯异步实现,其中可以有“N”个 BeginGetResponse() 未完成,并且对于每个完成的任务,您可以启动另一个任务。这样,您将始终有“N”个请求未完成。
You are tryihg to write a webcrawler. In order to write a good webcrawler, you first need to define some parameters...
1) How many request do you want to download simultaneously? In other words, how much throughput do you want? This will determine things like how many requests you want outstanding, what the threadpool size should be etc.
2) You will have to have a queue of URLs. This queue is populated by each request that completes. You now need to decide what the growth strategy of the queue is. For eg, you cannot have an unbounded queue, as you can pump workitems into the queue faster than you can download from the network.
Given this, you can design a system as follows:
Have max N worker threads that actually download from the web. They take one time from the queue, and download the data. They parse the data and populate your URL queue.
If there are more than 'M' URLs in the queue, then the queue blocks and does not allow anymore URLs to be queued. Now, here you can do one of two things. You can either cause the thread that is enqueuing to block, or you can just discard the workitem being enqueued. Once another workitem completes on another thread, and a URL is dequeued, the blocked thread will now be able to enqueue succesfully.
With a system like this, you can ensure that you will not run out of system resources while downloading the data.
Implementation:
Note that if you are using async, then you are using an extra I/O thread to do the download. THis is fine, as long as you are mindful of this fact. You can do a pure async implementation, where you can have 'N' BeginGetResponse() outstanding, and for each one that completes, you start another one. THis way you will always have 'N' requests outstanding.