HTTPWebResponse + StreamReader 非常慢

发布于 2024-07-22 12:10:21 字数 445 浏览 8 评论 0原文

我正在尝试用 C# 实现一个有限的网络爬虫（仅适用于几百个网站）使用 HttpWebResponse.GetResponse() 和 Streamreader.ReadToEnd() ，还尝试使用 StreamReader.Read() 和循环来构建我的 HTML 字符串。

我只下载大约5-10K的页面。

速度非常慢！例如，平均 GetResponse() 时间约为半秒，而平均 StreamREader.ReadToEnd() 时间约为 5 秒！

所有站点都应该非常快，因为它们离我的位置非常近，并且拥有快速的服务器。（在资源管理器中几乎不需要 D/L）并且我没有使用任何代理。

我的爬虫有大约 20 个线程同时从同一站点读取数据。这会引起问题吗？

如何大幅减少 StreamReader.ReadToEnd 时间？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ゝ偶尔ゞ 2024-07-29 12:10:21

HttpWebRequest 可能需要一段时间才能检测您的代理设置。尝试将其添加到您的应用程序配置中：

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

您还可能会发现通过缓冲读取以减少对底层操作系统套接字的调用次数可以带来轻微的性能提升：

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}

HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}

回复收藏 0 原文

薄情伤 2024-07-29 12:10:21

WebClient 的 DownloadString 是 HttpWebRequest 的简单包装器，您可以尝试暂时使用它，看看速度是否有所提高？如果事情变得更快，您能否分享您的代码，以便我们看看它可能有什么问题？

编辑：

HttpWebRequest 似乎遵守 IE 的“最大并发连接数”设置，这些 URL 是否位于同一域中？您可以尝试增加连接数限制，看看是否有帮助？我发现这篇文章有关该问题：

默认情况下，您无法执行更多操作
比 2-3 个异步 HttpWebRequest （取决于
在操作系统上）。为了覆盖它
（最简单的方法，恕我直言）不要忘记
添加到下面
应用程序配置中的部分
文件：

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>

WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?

EDIT:

It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this article about the problem:

By default, you can't perform more
than 2-3 async HttpWebRequest (depends
on the OS). In order to override it
(the easiest way, IMHO) don't forget
to add this under
section in the application's config
file:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>

回复收藏 0 原文

随梦而飞# 2024-07-29 12:10:21

我也遇到了同样的问题，但是当我将 HttpWebRequest 的 Proxy 参数设置为 null 时，问题就解决了。

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

回复收藏 0 原文

浅听莫相离 2024-07-29 12:10:21

您尝试过 ServicePointManager.maxConnections 吗？对于类似的事情，我通常将其设置为 200。

回复收藏 0 原文

羁客 2024-07-29 12:10:21

我遇到了同样的问题，但最糟糕。
响应 = (HttpWebResponse)webRequest.GetResponse(); 在我的代码中
在运行更多代码之前延迟了大约 10 秒，之后下载使我的连接饱和。

kurt 的回答 defaultProxyenabled="false"

解决了这个问题。现在响应几乎是立即的，我可以以我的连接最大速度下载任何 http 文件:)
抱歉英语不好

回复收藏 0 原文

时光无声 2024-07-29 12:10:21

我发现应用程序配置方法不起作用，但问题仍然是由于代理设置造成的。我的简单请求以前最多需要 30 秒，现在只需要 1 秒。

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}

I found the Application Config method did not work, but the problem was still due to the proxy settings. My simple request used to take up to 30 seconds, now it takes 1.

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}

回复收藏 0 原文

美煞众生 2024-07-29 12:10:21

谢谢大家的回答，他们帮助我找到了正确的方向。我遇到了同样的性能问题，尽管建议的更改应用程序配置文件的解决方案（据我了解该解决方案适用于 Web 应用程序）不符合我的需求，我的解决方案如下所示：

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}

Thank you all for answers, they've helped me to dig in proper direction. I've faced with the same performance issue, though proposed solution to change application config file (as I understood that solution is for web applications) doesn't fit my needs, my solution is shown below:

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}

回复收藏 0 原文

往日情怀 2024-07-29 12:10:21

为什么多线程不能解决这个问题？多线程可以最大限度地减少网络等待时间，并且由于您将缓冲区的内容存储在系统内存 (RAM) 中，因此处理文件系统不会出现 IO 瓶颈。因此，下载和解析需要 82 秒的 82 个页面，应该只需要 15 秒（假设使用 4x 处理器）。如果我遗漏了什么，请纠正我。

____ 下载线程_____*

下载内容

表单流

阅读内容

_________________________*

回复收藏 0 原文

恋你朝朝暮暮 2024-07-29 12:10:21

尝试将 cookie(AspxAutoDetectCookieSupport=1) 添加到您的请求中，如下所示

request.CookieContainer = new CookieContainer();         
request.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });

Try to add cookie(AspxAutoDetectCookieSupport=1) to your request like this

request.CookieContainer = new CookieContainer();         
request.CookieContainer.Add(new Cookie("AspxAutoDetectCookieSupport", "1") { Domain = target.Host });

回复收藏 0 原文

~没有更多了~