检查 url 是否为 text/html 或其他文件类型(例如图像)
我正在编写自己的 C# 4.0 WPF 特定网络爬虫。目前我正在使用 htmlagilitypack 来处理 html 文档。
现在我正在下载页面的方式如下
HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;
hwWeb = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
};
hdMyDoc = hwWeb.Load(srPageUrl);
private static bool OnPreRequest(HttpWebRequest request)
{
request.AllowAutoRedirect = true;
return true;
}
现在我的问题是我希望能够确定给定的 url 是 text/html (可爬行内容)还是 image/pdf 只是其他类型。我怎样才能做到这一点?
非常感谢您的回答。
C# 4.0,WPF应用程序
I am writing my own C# 4.0 WPF specific web crawler. Currently I am using htmlagilitypack to process html documents.
Now the way below i am downloading the pages
HtmlWeb hwWeb = new HtmlWeb();
hwWeb.UserAgent = lstAgents[GenerateRandomValue.GenerateRandomValueMin(irAgentsCount, 0)];
hwWeb.PreRequest = OnPreRequest;
HtmlDocument hdMyDoc;
hwWeb = new HtmlWeb
{
AutoDetectEncoding = false,
OverrideEncoding = Encoding.GetEncoding("iso-8859-9"),
};
hdMyDoc = hwWeb.Load(srPageUrl);
private static bool OnPreRequest(HttpWebRequest request)
{
request.AllowAutoRedirect = true;
return true;
}
Now my question is i want to be able to determine whether given url is text/html (crawlable content) or image/pdf simply other types. How can i do that ?
Thank you very much for the answers.
C# 4.0 , WPF application
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用
HttpWebRequest
下载页面,而不是依靠 HTMLAgilityPack 来下载它,该页面包含您可以检查的HttpWebResponse
属性。这将允许您在尝试解析内容之前执行检查。Rather than relying on HTMLAgilityPack to download it for you, you can download the page with
HttpWebRequest
which contains a property on theHttpWebResponse
that you can check. This would allow you to perform your check before attempting to parse the content.您想要读取响应标头中的内容类型。根据我的经验,我认为使用 HtmlAgility 包无法完成此任务。
You want to read the content-type in the response header. I do not think it can be done with HtmlAgility pack from my experience with it.
我从未使用过 html agility pack,但我继续查看了文档。
我看到您正在将 HtmlWeb 对象上的 PreRequest 字段设置为 PreRequestHandler 委托。还有一个采用 PostResponseHandler 委托的 PostResponse 字段。看起来 HtmlWeb 对象将以 HttpWebResponse 对象的形式向该委托传递从服务器获取的实际响应。
但是,当该委托中的代码完成时,敏捷包看起来将继续执行它本来会执行的操作。遇到非html会抛出异常吗?您可能必须从 PostResponse 函数中抛出自己的异常,并在调用 Load() 时捕获它。
正如我所说,我没有尝试过这些。希望它能让您朝着正确的方向开始。
I've never used html agility pack, but I went ahead and looked at the documentation.
I see that you're setting the PreRequest field on the HtmlWeb object to a PreRequestHandler delegate. There's also a PostResponse field that takes a PostResponseHandler delegate. It looks like the HtmlWeb object will pass that delegate the actual response it gets from the server, in the form of a HttpWebResponse object.
However, when your code in that delegate finishes, it looks like the agility pack will continue to do whatever it would've done. Does it throw an exception when it encounters non-html? You may have to throw your own exception from your PostResponse function and catch it when you call Load().
As I said, I didn't try any of this. Hope it gets you started in the right direction..