从 URL 获取 HTML 的优化方法

发布于 2024-11-30 02:01:08 字数 1432 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一城柳絮吹成雪 2024-12-07 02:01:08

WebClient 可能有一个更简单的 api,但两者都应该可以工作。

至于运行大量请求,您应该使用多个线程或线程池来实现它。如果网址位于同一服务器上,则应小心不要使其过载。

如果您想要通过线程池实现它的示例,我可以提供它们。

更新

using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;

namespace WebClientApp
{
class MainClassApp
{
    private static int requests = 0;
    private static object requests_lock = new object();

    public static void Main() {

        List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
        foreach(var url in urls) {
            ThreadPool.QueueUserWorkItem(GetUrl, url);
        }

        int cur_req = 0;

        while(cur_req<urls.Count) {

            lock(requests_lock) {
                cur_req = requests; 
            }

            Thread.Sleep(1000);
        }

        Console.WriteLine("Done");
    }

private static void GetUrl(Object the_url) {

        string url = (string)the_url;
        WebClient client = new WebClient();
        Stream data = client.OpenRead (url);

        StreamReader reader = new StreamReader(data);
        string html = reader.ReadToEnd ();

        /// Do something with html
        Console.WriteLine(html);

        lock(requests_lock) {
            requests++; 
        }
    }
}

}

WebClient probably has a more simple api but both should work.

As far as running a lot of requests you should implement it using multiple threads or a thread pool. If the urls are on the same server you should be careful not to overload it.

If you want examples to implement it via a thread pool I can provide them.

Update

using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;

namespace WebClientApp
{
class MainClassApp
{
    private static int requests = 0;
    private static object requests_lock = new object();

    public static void Main() {

        List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
        foreach(var url in urls) {
            ThreadPool.QueueUserWorkItem(GetUrl, url);
        }

        int cur_req = 0;

        while(cur_req<urls.Count) {

            lock(requests_lock) {
                cur_req = requests; 
            }

            Thread.Sleep(1000);
        }

        Console.WriteLine("Done");
    }

private static void GetUrl(Object the_url) {

        string url = (string)the_url;
        WebClient client = new WebClient();
        Stream data = client.OpenRead (url);

        StreamReader reader = new StreamReader(data);
        string html = reader.ReadToEnd ();

        /// Do something with html
        Console.WriteLine(html);

        lock(requests_lock) {
            requests++; 
        }
    }
}

}

绾颜 2024-12-07 02:01:08

使用 Parallel.Invoke 设置所有请求并给予它慷慨的 MaxDegreesOfParallelism

您将花费大部分时间等待 I/O,因此尽可能多地使用多线程。

Use Parallel.Invoke to set up all the requests and give it a generous MaxDegreesOfParallelism.

You'll be spending most of your time waiting on I/O, so make as much use of multi-threading as possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文