在一个网站内抓取所有网页的最快方法
我有一个 C# 应用程序,需要尽快抓取某个域内的许多页面。我有一个 Parallel.Foreach 循环遍历所有 url(多线程)并使用下面的代码来抓取它们:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.KeepAlive = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
如您所见,我有 http 压缩支持,并将 request.keepalive 和 request.pipelined 设置为 true。我想知道我使用的代码是否是在同一站点内抓取多个网页的最快方法,或者是否有更好的方法可以使会话针对多个请求保持打开状态。我的代码正在为我点击的每个页面创建一个新的请求实例,我是否应该尝试仅使用一个请求实例来点击所有页面?启用管道和保活是否理想?
I have a C# app that needs to scrape many many pages within a certain domain as fast as possible. I have a Parallel.Foreach that loops through all of the urls (multi-threaded) and scrapes them using the code below:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.KeepAlive = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
As you can see, I have http compression support and have set request.keepalive and request.pipelined to true. I'm wondering if the code I'm using is the fastest way to scrape many web pages within the same site or if there's a better way that will keep the session open for multiple requests. My code is creating a new request instance for each page I hit, should I be trying to use just one request instance to hit all of the pages? Is it ideal to have pipelined and keepalive enabled?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明我错过的是这个:
It turns out what I was missing was this: