HTTPWebResponse + StreamReader 非常慢

发布于 2024-07-22 12:10:21 字数 445 浏览 8 评论 0原文

我正在尝试用 C# 实现一个有限的网络爬虫(仅适用于几百个网站) 使用 HttpWebResponse.GetResponse() 和 Streamreader.ReadToEnd() ,还尝试使用 StreamReader.Read() 和循环来构建我的 HTML 字符串。

我只下载大约5-10K的页面。

速度非常慢! 例如,平均 GetResponse() 时间约为半秒,而平均 StreamREader.ReadToEnd() 时间约为 5 秒!

所有站点都应该非常快,因为它们离我的位置非常近,并且拥有快速的服务器。 (在资源管理器中几乎不需要 D/L)并且我没有使用任何代理。

我的爬虫有大约 20 个线程同时从同一站点读取数据。 这会引起问题吗?

如何大幅减少 StreamReader.ReadToEnd 时间?

I'm trying to implement a limited web crawler in C# (for a few hundred sites only)
using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.

I'm only downloading pages which are about 5-10K.

It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!

All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.

My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?

How do I reduce StreamReader.ReadToEnd times DRASTICALLY?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

ゝ偶尔ゞ 2024-07-29 12:10:21

HttpWebRequest 可能需要一段时间才能检测您的代理设置。 尝试将其添加到您的应用程序配置中:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

您还可能会发现通过缓冲读取以减少对底层操作系统套接字的调用次数可以带来轻微的性能提升:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}

HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}
薄情伤 2024-07-29 12:10:21

WebClient 的 DownloadString 是 HttpWebRequest 的简单包装器,您可以尝试暂时使用它,看看速度是否有所提高? 如果事情变得更快,您能否分享您的代码,以便我们看看它可能有什么问题?

编辑:

HttpWebRequest 似乎遵守 IE 的“最大并发连接数”设置,这些 URL 是否位于同一域中? 您可以尝试增加连接数限制,看看是否有帮助? 我发现这篇文章有关该问题:

默认情况下,您无法执行更多操作
比 2-3 个异步 HttpWebRequest (取决于
在操作系统上)。 为了覆盖它
(最简单的方法,恕我直言)不要忘记
添加到下面
应用程序配置中的部分
文件:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>

WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?

EDIT:

It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this article about the problem:

By default, you can't perform more
than 2-3 async HttpWebRequest (depends
on the OS). In order to override it
(the easiest way, IMHO) don't forget
to add this under
section in the application's config
file:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>
随梦而飞# 2024-07-29 12:10:21

我也遇到了同样的问题,但是当我将 HttpWebRequest 的 Proxy 参数设置为 null 时,问题就解决了。

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
浅听莫相离 2024-07-29 12:10:21

您尝试过 ServicePointManager.maxConnections 吗? 对于类似的事情,我通常将其设置为 200。

Have you tried ServicePointManager.maxConnections? I usually set it to 200 for things similar to this.

羁客 2024-07-29 12:10:21

我遇到了同样的问题,但最糟糕。
响应 = (HttpWebResponse)webRequest.GetResponse(); 在我的代码中
在运行更多代码之前延迟了大约 10 秒,之后下载使我的连接饱和。

kurt 的回答 defaultProxyenabled="false"

解决了这个问题。 现在响应几乎是立即的,我可以以我的连接最大速度下载任何 http 文件:)
抱歉英语不好

I had problem the same problem but worst.
response = (HttpWebResponse)webRequest.GetResponse(); in my code
delayed about 10 seconds before running more code and after this the download saturated my connection.

kurt's answer defaultProxy enabled="false"

solved the problem. now the response is almost instantly and i can download any http file at my connections maximum speed :)
sorry for bad english

时光无声 2024-07-29 12:10:21

我发现应用程序配置方法不起作用,但问题仍然是由于代理设置造成的。 我的简单请求以前最多需要 30 秒,现在只需要 1 秒。

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}

I found the Application Config method did not work, but the problem was still due to the proxy settings. My simple request used to take up to 30 seconds, now it takes 1.

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}
美煞众生 2024-07-29 12:10:21

谢谢大家的回答,他们帮助我找到了正确的方向。 我遇到了同样的性能问题,尽管建议的更改应用程序配置文件的解决方案(据我了解该解决方案适用于 Web 应用程序)不符合我的需求,我的解决方案如下所示:

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}

Thank you all for answers, they've helped me to dig in proper direction. I've faced with the same performance issue, though proposed solution to change application config file (as I understood that solution is for web applications) doesn't fit my needs, my solution is shown below:

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}
往日情怀 2024-07-29 12:10:21

为什么多线程不能解决这个问题? 多线程可以最大限度地减少网络等待时间,并且由于您将缓冲区的内容存储在系统内存 (RAM) 中,因此处理文件系统不会出现 IO 瓶颈。 因此,下载和解析需要 82 秒的 82 个页面,应该只需要 15 秒(假设使用 4x 处理器)。 如果我遗漏了什么,请纠正我。

____ 下载线程_____*

下载内容

表单流

阅读内容

_________________________*

Why wouldn't multithreading solve this issue? Multithreading would minimize the network wait times, and since you'd be storing the contents of the buffer in system memory (RAM), there would be no IO bottleneck from dealing with a filesystem. Thus, your 82 pages that take 82 seconds to download and parse, should take like 15 seconds (assuming a 4x processor). Correct me if I'm missing something.

____ DOWNLOAD THREAD_____*

Download Contents

Form Stream

Read Contents

_________________________*

恋你朝朝暮暮 2024-07-29 12:10:21

尝试将 cookie(AspxAutoDetectCookieSupport=1) 添加到您的请求中,如下所示

request.CookieContainer = new CookieContainer();         
request.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });

Try to add cookie(AspxAutoDetectCookieSupport=1) to your request like this

request.CookieContainer = new CookieContainer();         
request.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文