HttpWebRequest 逐渐停止,可能只是由于页面大小

发布于 2024-12-27 15:18:31 字数 1002 浏览 0 评论 0原文

我有一个 WPF 应用程序,它处理大量 url(数千个),每个 url 发送到它自己的线程,进行一些处理并将结果存储在数据库中。

网址可以是任何内容,但有些似乎是非常大的页面,这似乎会大幅增加内存使用量并使性能非常糟糕。我在网络请求上设置了超时,因此如果花费的时间超过 20 秒,则不会影响该 url,但似乎没有太大区别。

这是代码部分:

               HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(urlAddress.Address);
                            req.Timeout = 20000;
                            req.ReadWriteTimeout = 20000;
                            req.Method = "GET";
                            req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

                            using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
                            {
                                pageSource = reader.ReadToEnd();
                                req = null;
                            }

它似乎还停止/增加 reader.ReadToEnd(); 上的内存;

我本以为中断 20 秒会有所帮助,有更好的方法吗?我认为使用异步 Web 方法没有太大优势,因为无论如何每个 url 下载都在其自己的线程上。

谢谢

I have WPF app that processes a lot of urls (thousands), each it sends off to it's own thread, does some processing and stores a result in the database.

The urls can be anything, but some seem to be massively big pages, this seems to shoot the memory usage up a lot and make performance really bad. I set a timeout on the web request, so if it took longer than say 20 seconds it doesn't bother with that url, but it seems to not make much difference.

Here's the code section:

               HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(urlAddress.Address);
                            req.Timeout = 20000;
                            req.ReadWriteTimeout = 20000;
                            req.Method = "GET";
                            req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

                            using (StreamReader reader = new StreamReader(req.GetResponse().GetResponseStream()))
                            {
                                pageSource = reader.ReadToEnd();
                                req = null;
                            }

It also seems to stall/ramp up memory on reader.ReadToEnd();

I would have thought having a cut off of 20 seconds would help, is there a better method? I assume there's not much advantage to using asynch web method as each url download is on its own thread anyway..

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

撩起发的微风 2025-01-03 15:18:31

一般来说,建议您使用异步 HttpWebRequests 而不是创建自己的线程。我上面链接的文章还包括一些基准测试结果。

我不知道在阅读流结束后您正在对页面源做什么,但是 使用字符串可能是一个问题

System.String 类型可用于任何 .NET 应用程序。我们有弦
如:名称、地址、描述、错误消息、警告甚至
应用程序设置。每个应用程序都必须创建、比较或
格式化字符串数据。考虑到不变性和任何事实
对象可以转换为字符串,所有可用内存都可以
被大量不需要的重复字符串或无人认领的字符串吞没
字符串对象。

其他一些建议:

  • 您有任何防火墙限制吗?我在工作中遇到了很多问题,其中防火墙启用了速率限制,并且获取页面逐渐停止(我经常遇到这种情况) !
  • 我假设您将使用该字符串来解析 HTML,因此我建议您使用 Stream 初始化解析器,而不是传入包含页面源的字符串(如果可以选择) 。
  • 如果您将页面源存储在数据库中,那么您无能为力。
  • 尝试通过注释掉页面源的读取作为内存/性能问题的潜在贡献者。
  • 使用流式 HTML 解析器,例如 Majestic 12-避免将整个页面源加载到内存中(同样,如果您需要解析)!
  • 限制要下载的页面大小,例如仅下载 150KB。 平均页面大小约为100KB-130KB

另外,你能告诉我吗?我们您的初始抓取页面速率是多少?它会影响什么?当您获取页面时,您是否在网络请求中看到任何错误/异常?

更新

在评论部分我注意到您正在创建数千个线程,我想说您不需要这样做。从少量线程开始,然后不断增加它们,直到您看到系统的性能。一旦开始添加线程并且性能看起来逐渐下降,请停止添加线程。我无法想象您将需要超过 128 个线程(即使这看起来很高)。创建固定数量的线程,例如 64 个,让每个线程从队列中获取 URL,获取页面,处理它,然后返回再次从队列中获取页面。

In general, it's recommended that you use asynchronous HttpWebRequests instead of creating your own threads. The article I've linked above also includes some benchmarking results.

I don't know what you're doing with the page source after you read the stream to end, but using string can be an issue:

System.String type is used in any .NET application. We have strings
as: names, addresses, descriptions, error messages, warnings or even
application settings. Each application has to create, compare or
format string data. Considering the immutability and the fact that any
object can be converted to a string, all the available memory can be
swallowed by a huge amount of unwanted string duplicates or unclaimed
string objects.

Some other suggestions:

  • Do you have any firewall restrictions? I've seen a lot of issues at work where the firewall enables rate limiting and fetching pages grinds down to a halt (happens to me all the time)!
  • I presume that you're going to use the string to parse HTML, so I would recommend that you initialize your parser with the Stream instead of passing in a string containing the page source (if that's an option).
  • If you're storing the page source in the database, then there isn't much you can do.
  • Try to eliminate the reading of the page source as a potential contributor to the memory/performance problem by commenting it out.
  • Use a streaming HTML parser such as Majestic 12- avoids the need to load the entire page source into memory (again, if you need to parse)!
  • Limit the size of the pages you're going to download, say, only download 150KB. The average page size is about 100KB-130KB

Additionally, can you tell us what's your initial rate of fetching pages and what does it go down to? Are you seeing any errors/exceptions from the web request as you're fetching pages?

Update

In the comment section I noticed that you're creating thousands of threads and I would say that you don't need to do that. Start with a small number of threads and keep increasing them until you peek the performance on your system. Once you start adding threads and the performance looks like it's tapered off, then sop adding threads. I can't imagine that you will need more than 128 threads (even that seems high). Create a fixed number of threads, e.g. 64, let each thread take a URL from your queue, fetch the page, process it and then go back to getting pages from the queue again.

久夏青 2025-01-03 15:18:31

您可以使用缓冲区进行枚举,而不是调用 ReadToEnd,如果花费的时间太长,那么您可以记录并放弃 - 类似于:

static void Main(string[] args)
{
  Uri largeUri = new Uri("http://www.rfkbau.de/index.php?option=com_easybook&Itemid=22&startpage=7096");
  DateTime start = DateTime.Now;
  int timeoutSeconds = 10;
  foreach (var s in ReadLargePage(largeUri))
  {
    if ((DateTime.Now - start).TotalSeconds > timeoutSeconds)
    {
      Console.WriteLine("Stopping - this is taking too long.");
      break;
    }

  }
}

static IEnumerable<string> ReadLargePage(Uri uri)
{            
  int bufferSize = 8192;
  int readCount;
  Char[] readBuffer = new Char[bufferSize];
  HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); 
  using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
  using (StreamReader stream = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
  {
    readCount = stream.Read(readBuffer, 0, bufferSize);
    while (readCount > 0)
    {
      yield return new string(readBuffer, 0, bufferSize);
      readCount = stream.Read(readBuffer, 0, bufferSize);
    }
  }
}

You could enumerate with a buffer instead of calling ReadToEnd, and if it is taking too long, then you could log and abandon - something like:

static void Main(string[] args)
{
  Uri largeUri = new Uri("http://www.rfkbau.de/index.php?option=com_easybook&Itemid=22&startpage=7096");
  DateTime start = DateTime.Now;
  int timeoutSeconds = 10;
  foreach (var s in ReadLargePage(largeUri))
  {
    if ((DateTime.Now - start).TotalSeconds > timeoutSeconds)
    {
      Console.WriteLine("Stopping - this is taking too long.");
      break;
    }

  }
}

static IEnumerable<string> ReadLargePage(Uri uri)
{            
  int bufferSize = 8192;
  int readCount;
  Char[] readBuffer = new Char[bufferSize];
  HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); 
  using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
  using (StreamReader stream = new StreamReader(response.GetResponseStream(), Encoding.UTF8))
  {
    readCount = stream.Read(readBuffer, 0, bufferSize);
    while (readCount > 0)
    {
      yield return new string(readBuffer, 0, bufferSize);
      readCount = stream.Read(readBuffer, 0, bufferSize);
    }
  }
}
暖风昔人 2025-01-03 15:18:31

Lirik 的总结非常好。

我想补充一点,如果我要实现这一点,我将创建一个单独的进程来读取页面。所以,这将是一个管道。第一阶段将下载 URL 并将其写入磁盘位置。然后将该文件排队到下一阶段。下一阶段从磁盘读取并进行解析和处理。数据库更新。这样您也将获得下载和解析的最大吞吐量。您还可以调整线程池,以便有更多的工作人员进行解析等。这种架构也非常适合分布式处理,您可以在其中一台机器上下载,另一台主机上解析等。

另一件需要注意的事情是,如果您从多个线程访问同一服务器(即使您使用异步),那么您将遇到最大传出连接限制。您可以限制自己以保持在该值以下,或者增加 ServicePointManager 类的连接限制。

Lirik has really good summary.

I would add that if I were implementing this, I would make a separate process that reads the pages. So, it would be a pipeline. First stage would download the URL and write it to a disk location. And then queue that file to the next stage. Next stage reads from the disk and does the parsing & DB updates. That way you will get max throughput on the download and parsing as well. You can also tune your threadpools so that you have more workers parsing, etc. This architecture also lends very well to distributed processing where you can have one machine downloading, and another host parsing/etc.

Another thing to note is that if you are hitting the same server from multiple threads (even if you are using Async) then you will hit yourself against the max outgoing connection limit. You can throttle yourself to stay below that, or increase the connection limit on the ServicePointManager class.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文