使用 htmlagilitypack 如何确定 URI 是否为文件
例如,对于此网址,它不会将其显示为文件
http ://www.darty.com.tr/e_commerce/ximg/yeniyil/darty%20garanty%20brosur.pdf
但它是一个pdf文件。所以我想要的只是确定所有无法抓取的网址,例如 pdf 或 doc 或 docx 等。我如何使用 c# 4.0 和 htmlagilitypack 做到这一点?
谢谢。
无法识别为文件: http://img695.imageshack.us/img695/61/notshowasfile .png
For example for this url it does not show it as file
http://www.darty.com.tr/e_commerce/ximg/yeniyil/darty%20garanty%20brosur.pdf
But it is a pdf file. So what i want is simply determine all of the urls which can not be crawled like pdf or doc or docx etc. How can i do that with c# 4.0 and htmlagilitypack ?
Thank you.
Does not recognize as file : http://img695.imageshack.us/img695/61/notshowasfile.png
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Uri
是基础 .NET Framework 的一部分 - 这与 HTML Agility Pack 无关。这也与它是 PDF 无关。
Uri.IsFile
说:换句话说,
IsFile
回答了这个问题:“这是一个文件吗: //
URI"?由于这是一个http://
URI,所以答案是否定的。您似乎将 URL 与内容混淆了。
Uri
只是一个奇特的字符串;它的工作是成为一个URI,而不是访问服务器并询问有关该URL的内容的问题。 “这是我知道如何抓取的文件类型吗?”通过查看 URL 无法回答;http://example.com/articles/123
可以是网页、PDF、文本文件、JPEG 或上千种其他内容中的任何一个。您必须向服务器发出 GET 或 HEAD 请求,并查看返回的 Content-Type,才能了解该 URL 代表的内容类型。Uri
is part of the base .NET Framework -- this has nothing to do with the HTML Agility Pack.This also has nothing to do with it being a PDF. The documentation for
Uri.IsFile
says:In other words,
IsFile
answers the question, "Is this afile://
URI"? Since this is anhttp://
URI, the answer is no.You seem to be confusing URLs with content. A
Uri
is just a fancy string; its job is to be a URI, not to go out to the server and ask questions about the content at that URL. "Is this a file type I know how to crawl?" cannot be answered by looking at the URL;http://example.com/articles/123
could be a Web page or a PDF or a text file or a JPEG or any of a thousand other things. You have to do a GET or HEAD request to the server, and look at the returned Content-Type, in order to know what type of content that URL represents.