HTTPWebResponse + StreamReader 非常慢
我正在尝试用 C# 实现一个有限的网络爬虫(仅适用于几百个网站) 使用 HttpWebResponse.GetResponse() 和 Streamreader.ReadToEnd() ,还尝试使用 StreamReader.Read() 和循环来构建我的 HTML 字符串。
我只下载大约5-10K的页面。
速度非常慢! 例如,平均 GetResponse() 时间约为半秒,而平均 StreamREader.ReadToEnd() 时间约为 5 秒!
所有站点都应该非常快,因为它们离我的位置非常近,并且拥有快速的服务器。 (在资源管理器中几乎不需要 D/L)并且我没有使用任何代理。
我的爬虫有大约 20 个线程同时从同一站点读取数据。 这会引起问题吗?
如何大幅减少 StreamReader.ReadToEnd 时间?
I'm trying to implement a limited web crawler in C# (for a few hundred sites only)
using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.
I'm only downloading pages which are about 5-10K.
It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!
All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.
My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?
How do I reduce StreamReader.ReadToEnd times DRASTICALLY?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
HttpWebRequest 可能需要一段时间才能检测您的代理设置。 尝试将其添加到您的应用程序配置中:
您还可能会发现通过缓冲读取以减少对底层操作系统套接字的调用次数可以带来轻微的性能提升:
HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:
You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:
WebClient 的 DownloadString 是 HttpWebRequest 的简单包装器,您可以尝试暂时使用它,看看速度是否有所提高? 如果事情变得更快,您能否分享您的代码,以便我们看看它可能有什么问题?
编辑:
HttpWebRequest 似乎遵守 IE 的“最大并发连接数”设置,这些 URL 是否位于同一域中? 您可以尝试增加连接数限制,看看是否有帮助? 我发现这篇文章有关该问题:
WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?
EDIT:
It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this article about the problem:
我也遇到了同样的问题,但是当我将 HttpWebRequest 的 Proxy 参数设置为 null 时,问题就解决了。
I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.
您尝试过 ServicePointManager.maxConnections 吗? 对于类似的事情,我通常将其设置为 200。
Have you tried ServicePointManager.maxConnections? I usually set it to 200 for things similar to this.
我遇到了同样的问题,但最糟糕。
响应 = (HttpWebResponse)webRequest.GetResponse(); 在我的代码中
在运行更多代码之前延迟了大约 10 秒,之后下载使我的连接饱和。
kurt 的回答 defaultProxyenabled="false"
解决了这个问题。 现在响应几乎是立即的,我可以以我的连接最大速度下载任何 http 文件:)
抱歉英语不好
I had problem the same problem but worst.
response = (HttpWebResponse)webRequest.GetResponse(); in my code
delayed about 10 seconds before running more code and after this the download saturated my connection.
kurt's answer defaultProxy enabled="false"
solved the problem. now the response is almost instantly and i can download any http file at my connections maximum speed :)
sorry for bad english
我发现应用程序配置方法不起作用,但问题仍然是由于代理设置造成的。 我的简单请求以前最多需要 30 秒,现在只需要 1 秒。
I found the Application Config method did not work, but the problem was still due to the proxy settings. My simple request used to take up to 30 seconds, now it takes 1.
谢谢大家的回答,他们帮助我找到了正确的方向。 我遇到了同样的性能问题,尽管建议的更改应用程序配置文件的解决方案(据我了解该解决方案适用于 Web 应用程序)不符合我的需求,我的解决方案如下所示:
Thank you all for answers, they've helped me to dig in proper direction. I've faced with the same performance issue, though proposed solution to change application config file (as I understood that solution is for web applications) doesn't fit my needs, my solution is shown below:
为什么多线程不能解决这个问题? 多线程可以最大限度地减少网络等待时间,并且由于您将缓冲区的内容存储在系统内存 (RAM) 中,因此处理文件系统不会出现 IO 瓶颈。 因此,下载和解析需要 82 秒的 82 个页面,应该只需要 15 秒(假设使用 4x 处理器)。 如果我遗漏了什么,请纠正我。
____ 下载线程_____*
下载内容
表单流
阅读内容
_________________________*
Why wouldn't multithreading solve this issue? Multithreading would minimize the network wait times, and since you'd be storing the contents of the buffer in system memory (RAM), there would be no IO bottleneck from dealing with a filesystem. Thus, your 82 pages that take 82 seconds to download and parse, should take like 15 seconds (assuming a 4x processor). Correct me if I'm missing something.
____ DOWNLOAD THREAD_____*
Download Contents
Form Stream
Read Contents
_________________________*
尝试将 cookie(
AspxAutoDetectCookieSupport=1
) 添加到您的请求中,如下所示Try to add cookie(
AspxAutoDetectCookieSupport=1
) to your request like this