HtmlAgilityPack WebGet.Load 给出错误“对象引用未设置到对象的实例”
我正在进行一个有关从经销商网站获取新车价格的项目。我可以获取大多数网站的 html。但是当我尝试加载其中一个 WebGet.Load(url) 方法时,会出现 Object reference not set to an instance of an object.
错误。我找不到这些网站之间的任何差异。
正常工作网址示例:
http://www.renault.com.tr/page.aspx?id=1715
http://www.hyundai.com.tr/tr/Content。 aspx?id=fiyatlistesi
网站有问题:
http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx
谢谢您的帮助。
var webGet = new HtmlWeb();
var document = webGet.Load("http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx");
当我使用这个 url 文档时没有加载。
I am on a project about getting new car prices from dealers websites. I can fetch most web sites html. But when I try to load one of them WebGet.Load(url) method gives Object reference not set to an instance of an object.
error. I couldn't find any differences between these web sites.
Normal working url examples :
http://www.renault.com.tr/page.aspx?id=1715
http://www.hyundai.com.tr/tr/Content.aspx?id=fiyatlistesi
website problematic :
http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx
Thank you for your help.
var webGet = new HtmlWeb();
var document = webGet.Load("http://www.fiat.com.tr/Pages/tr/otomobiller/grandepunto_fiyat.aspx");
When I use this url document is not loaded.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
实际问题出在 HtmlAgilityPack 内部。无法正常工作的页面具有以下元内容类型:
其中
charset= 8859-9 似乎不正确。 HAL 内部尝试使用
Encoding.GetEncoding("8859-9")
之类的东西来获取该字符串的适当编码,这会引发错误(我认为实际编码应该是iso -8859-9
)。实际上,您需要做的就是告诉 HAL 不要读取
HtmlDocument
的编码(只需HtmlDocument.OptionReadEncoding = true
),但这对于HtmlWeb 来说似乎是不可能的.Load
(设置HtmlWeb.AutoDetectEncoding
在这里不起作用)。因此,解决方法可能是手动读取 url(最简单的方法):这有效,并成功解析页面。
编辑: @:Simon Mourier:是的,它会引发
NullReferenceException
因为它捕获ArgumentException
并在那里设置_declaredencoding = null
。然后_declaredencoding.WindowsCodePage
行抛出空引用。这是 HtmlDocument.cs 的
ReadDocumentEncoding
方法中的代码块:这是我的堆栈跟踪:
The actual problem is in HtmlAgilityPack internals. The page not working has this meta content type:
<META http-equiv="Content-Type" content="text/html; charset=8859-9">
wherecharset=8859-9
seems to be incorrent. The HAL internals tries to get an appropriate encoding for this string by using something likeEncoding.GetEncoding("8859-9")
and this throws an error (I think the actual encoding should beiso-8859-9
).Actually all you need is to tell the HAL not to read encoding for the
HtmlDocument
(justHtmlDocument.OptionReadEncoding = true
), but this seems to be impossible withHtmlWeb.Load
(settingHtmlWeb.AutoDetectEncoding
isn't work here). So, the workaround could be in a manual reading of the url (the simplest way):This works, and successfully parses the page.
EDIT: @:Simon Mourier: yes, it raises
NullReferenceException
because it catchesArgumentException
and sets_declaredencoding = null
there. And then_declaredencoding.WindowsCodePage
line throws the null reference.here is a code block from the HtmlDocument.cs,
ReadDocumentEncoding
method:And here is my stack trace: