检查 url 是否为 text/html 或其他文件类型(例如图像)

发布于 2025-01-02 05:14:01 字数 775 浏览 1 评论 0原文

我正在编写自己的 C# 4.0 WPF 特定网络爬虫。目前我正在使用 htmlagilitypack 来处理 html 文档。

现在我正在下载页面的方式如下

HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;

hwWeb = new HtmlWeb
                {
                    AutoDetectEncoding = false,
                    OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
                };
hdMyDoc = hwWeb.Load(srPageUrl);


        private static bool OnPreRequest(HttpWebRequest request)
    {
        request.AllowAutoRedirect = true;
        return true;
    }

现在我的问题是我希望能够确定给定的 url 是 text/html (可爬行内容)还是 image/pdf 只是其他类型。我怎样才能做到这一点?

非常感谢您的回答。

C# 4.0,WPF应用程序

I am writing my own C# 4.0 WPF specific web crawler. Currently I am using htmlagilitypack to process html documents.

Now the way below i am downloading the pages

HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;

hwWeb = new HtmlWeb
                {
                    AutoDetectEncoding = false,
                    OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
                };
hdMyDoc = hwWeb.Load(srPageUrl);


        private static bool OnPreRequest(HttpWebRequest request)
    {
        request.AllowAutoRedirect = true;
        return true;
    }

Now my question is i want to be able to determine whether given url is text/html (crawlable content) or image/pdf simply other types. How can i do that ?

Thank you very much for the answers.

C# 4.0 , WPF application

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吹梦到西洲 2025-01-09 05:14:01

您可以使用 HttpWebRequest 下载页面,而不是依靠 HTMLAgilityPack 来下载它,该页面包含您可以检查的 HttpWebResponse 属性。这将允许您在尝试解析内容之前执行检查。

Rather than relying on HTMLAgilityPack to download it for you, you can download the page with HttpWebRequest which contains a property on the HttpWebResponse that you can check. This would allow you to perform your check before attempting to parse the content.

回忆凄美了谁 2025-01-09 05:14:01

您想要读取响应标头中的内容类型。根据我的经验,我认为使用 HtmlAgility 包无法完成此任务。

You want to read the content-type in the response header. I do not think it can be done with HtmlAgility pack from my experience with it.

指尖上的星空 2025-01-09 05:14:01

我从未使用过 html agility pack,但我继续查看了文档。

我看到您正在将 HtmlWeb 对象上的 PreRequest 字段设置为 PreRequestHandler 委托。还有一个采用 PostResponseHandler 委托的 PostResponse 字段。看起来 HtmlWeb 对象将以 HttpWebResponse 对象的形式向该委托传递从服务器获取的实际响应。

但是,当该委托中的代码完成时,敏捷包看起来将继续执行它本来会执行的操作。遇到非html会抛出异常吗?您可能必须从 PostResponse 函数中抛出自己的异常,并在调用 Load() 时捕获它。

正如我所说,我没有尝试过这些。希望它能让您朝着正确的方向开始。

I've never used html agility pack, but I went ahead and looked at the documentation.

I see that you're setting the PreRequest field on the HtmlWeb object to a PreRequestHandler delegate. There's also a PostResponse field that takes a PostResponseHandler delegate. It looks like the HtmlWeb object will pass that delegate the actual response it gets from the server, in the form of a HttpWebResponse object.

However, when your code in that delegate finishes, it looks like the agility pack will continue to do whatever it would've done. Does it throw an exception when it encounters non-html? You may have to throw your own exception from your PostResponse function and catch it when you call Load().

As I said, I didn't try any of this. Hope it gets you started in the right direction..

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文