在 VB.Net 中使用 HTMLAgilitypack 防止错误
我正在使用 HTMLAgilityPack 来解析 HTML 页面。然而,在某些时候,我尝试解析错误的数据(在这种特定情况下是图像),由于明显的原因,ofc 失败了。
Private Sub parseHtml(ByVal content As String, ByVal url As String)
Try
Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
cururl = link.OuterHtml
If link.Attributes("href") Is Nothing Then Continue For
If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
urlQueue.Enqueue(link.Attributes("href").Value)
Else
Dim myUri As New Uri(url)
urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
End If
Next
Catch ex As Exception
MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
End Try
End Sub
我得到的错误是:
类型的第一次机会异常 '系统.NullReferenceException' 发生在 Webcrawler.exe 对象中 未设置对实例的引用 对象。
关于内容我尝试解析:
���我评价�+�:8�0�x�
在尝试解析内容之前如何检查内容是否“可解析”以防止错误?
目前,它是一个会弹出错误的图像,但我认为它可能只是任何不是 (x)html 的图像。
预先感谢伟大的社区:)
I'm using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for obvious reasons.
Private Sub parseHtml(ByVal content As String, ByVal url As String)
Try
Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
Dim doc As HtmlDocument = New HtmlDocument()
doc.Load(New StringReader(content))
Dim root As HtmlNode = doc.DocumentNode
Dim anchorTags As New List(Of String)
For Each link As HtmlNode In root.SelectNodes("//a")
cururl = link.OuterHtml
If link.Attributes("href") Is Nothing Then Continue For
If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
urlQueue.Enqueue(link.Attributes("href").Value)
Else
Dim myUri As New Uri(url)
urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
End If
Next
Catch ex As Exception
MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
End Try
End Sub
The error I get is:
A first chance exception of type
'System.NullReferenceException'
occurred in Webcrawler.exe Object
reference not set to an instance of an
object.
On the content I try to parse:
�����Iޥ�+�: 8�0�x�
How to check whether the content is 'parse-able' before trying to parse it to prevent the error?
For now it is an image which makes an error popup however I think it might be just anything which isn't (x)html.
Thanks in advance ow great community :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在尝试解析返回的数据之前,您需要检查返回的
content-type
标头。对于 HTML 页面,这应该是
text/html
,对于 XHTML 来说,这应该是application/xhtml+xml
。You need to check the returned
content-type
header before trying to parse the returned data.For an HTML page this should be
text/html
, for XHTML is would beapplication/xhtml+xml
.如果您只有内容(如果您无法像 Oded 建议的那样访问原始 HTTP 标头),您可以假设一个好的 HTML 字符串应该至少包含一个“<”例如,字符串的前 10 个字符内的字符。
当然,这并不能保证,您仍然需要处理极端情况,但这应该丢弃大多数垃圾或意外的内容类型,并让特定的编码字节顺利通过(例如 UTF-8 字节顺序标记等...... )。
If you only have the content (If you can't have access to original HTTP headers like Oded suggested), you could assume a good HTML string should contain at least a "<" character within, say, the 10 first characters of the string.
Of course, there is no guarantee and you will still need to handle the extreme cases, but this should discard most garbage or unexpected content types, and will let specific encoding bytes pass fine (like UTF-8 byte order mark, etc...).