HttpWebRequest 逐渐停止,可能只是由于页面大小
我有一个 WPF 应用程序,它处理大量 url(数千个),每个 url 发送到它自己的线程,进行一些处理并将结果存储在数据库中。
网址可以是任何内容,但有些似乎是非常大的页面,这似乎会大幅增加内存使用量并使性能非常糟糕。我在网络请求上设置了超时,因此如果花费的时间超过 20 秒,则不会影响该 url,但似乎没有太大区别。
这是代码部分:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(urlAddress.Address);
req.Timeout = 20000;
req.ReadWriteTimeout = 20000;
req.Method = "GET";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
pageSource = reader.ReadToEnd();
req = null;
}
它似乎还停止/增加 reader.ReadToEnd(); 上的内存;
我本以为中断 20 秒会有所帮助,有更好的方法吗?我认为使用异步 Web 方法没有太大优势,因为无论如何每个 url 下载都在其自己的线程上。
谢谢
I have WPF app that processes a lot of urls (thousands), each it sends off to it's own thread, does some processing and stores a result in the database.
The urls can be anything, but some seem to be massively big pages, this seems to shoot the memory usage up a lot and make performance really bad. I set a timeout on the web request, so if it took longer than say 20 seconds it doesn't bother with that url, but it seems to not make much difference.
Here's the code section:
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(urlAddress.Address);
req.Timeout = 20000;
req.ReadWriteTimeout = 20000;
req.Method = "GET";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
{
pageSource = reader.ReadToEnd();
req = null;
}
It also seems to stall/ramp up memory on reader.ReadToEnd();
I would have thought having a cut off of 20 seconds would help, is there a better method? I assume there's not much advantage to using asynch web method as each url download is on its own thread anyway..
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一般来说,建议您使用异步 HttpWebRequests 而不是创建自己的线程。我上面链接的文章还包括一些基准测试结果。
我不知道在阅读流结束后您正在对页面源做什么,但是 使用字符串可能是一个问题:
其他一些建议:
Stream
初始化解析器,而不是传入包含页面源的字符串(如果可以选择) 。另外,你能告诉我吗?我们您的初始抓取页面速率是多少?它会影响什么?当您获取页面时,您是否在网络请求中看到任何错误/异常?
更新
在评论部分我注意到您正在创建数千个线程,我想说您不需要这样做。从少量线程开始,然后不断增加它们,直到您看到系统的性能。一旦开始添加线程并且性能看起来逐渐下降,请停止添加线程。我无法想象您将需要超过 128 个线程(即使这看起来很高)。创建固定数量的线程,例如 64 个,让每个线程从队列中获取 URL,获取页面,处理它,然后返回再次从队列中获取页面。
In general, it's recommended that you use asynchronous HttpWebRequests instead of creating your own threads. The article I've linked above also includes some benchmarking results.
I don't know what you're doing with the page source after you read the stream to end, but using string can be an issue:
Some other suggestions:
Stream
instead of passing in a string containing the page source (if that's an option).Additionally, can you tell us what's your initial rate of fetching pages and what does it go down to? Are you seeing any errors/exceptions from the web request as you're fetching pages?
Update
In the comment section I noticed that you're creating thousands of threads and I would say that you don't need to do that. Start with a small number of threads and keep increasing them until you peek the performance on your system. Once you start adding threads and the performance looks like it's tapered off, then sop adding threads. I can't imagine that you will need more than 128 threads (even that seems high). Create a fixed number of threads, e.g. 64, let each thread take a URL from your queue, fetch the page, process it and then go back to getting pages from the queue again.
您可以使用缓冲区进行枚举,而不是调用 ReadToEnd,如果花费的时间太长,那么您可以记录并放弃 - 类似于:
You could enumerate with a buffer instead of calling ReadToEnd, and if it is taking too long, then you could log and abandon - something like:
Lirik 的总结非常好。
我想补充一点,如果我要实现这一点,我将创建一个单独的进程来读取页面。所以,这将是一个管道。第一阶段将下载 URL 并将其写入磁盘位置。然后将该文件排队到下一阶段。下一阶段从磁盘读取并进行解析和处理。数据库更新。这样您也将获得下载和解析的最大吞吐量。您还可以调整线程池,以便有更多的工作人员进行解析等。这种架构也非常适合分布式处理,您可以在其中一台机器上下载,另一台主机上解析等。
另一件需要注意的事情是,如果您从多个线程访问同一服务器(即使您使用异步),那么您将遇到最大传出连接限制。您可以限制自己以保持在该值以下,或者增加 ServicePointManager 类的连接限制。
Lirik has really good summary.
I would add that if I were implementing this, I would make a separate process that reads the pages. So, it would be a pipeline. First stage would download the URL and write it to a disk location. And then queue that file to the next stage. Next stage reads from the disk and does the parsing & DB updates. That way you will get max throughput on the download and parsing as well. You can also tune your threadpools so that you have more workers parsing, etc. This architecture also lends very well to distributed processing where you can have one machine downloading, and another host parsing/etc.
Another thing to note is that if you are hitting the same server from multiple threads (even if you are using Async) then you will hit yourself against the max outgoing connection limit. You can throttle yourself to stay below that, or increase the connection limit on the ServicePointManager class.