在 VB.Net 中使用 HTMLAgilitypack 防止错误

发布于 2024-10-09 09:05:25 字数 1449 浏览 4 评论 0原文

我正在使用 HTMLAgilityPack 来解析 HTML 页面。然而,在某些时候,我尝试解析错误的数据(在这种特定情况下是图像),由于明显的原因,ofc 失败了。

Private Sub parseHtml(ByVal content As String, ByVal url As String)
    Try
        Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
        Dim doc As HtmlDocument = New HtmlDocument()

        doc.Load(New StringReader(content))

        Dim root As HtmlNode = doc.DocumentNode
        Dim anchorTags As New List(Of String)

        For Each link As HtmlNode In root.SelectNodes("//a")
            cururl = link.OuterHtml
            If link.Attributes("href") Is Nothing Then Continue For
            If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
                urlQueue.Enqueue(link.Attributes("href").Value)
            Else
                Dim myUri As New Uri(url)
                urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
            End If
        Next
    Catch ex As Exception
        MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
    End Try
End Sub

我得到的错误是:

类型的第一次机会异常 '系统.NullReferenceException' 发生在 Webcrawler.exe 对象中 未设置对实例的引用 对象。

关于内容我尝试解析:

���我评价�+�:8�0�x�

在尝试解析内容之前如何检查内容是否“可解析”以防止错误?

目前,它是一个会弹出错误的图像,但我认为它可能只是任何不是 (x)html 的图像。

预先感谢伟大的社区:)

I'm using the HTMLAgilityPack to parse HTML pages. However at some point I try to parse wrong data (in this specific case an image), which ofc fails for obvious reasons.

Private Sub parseHtml(ByVal content As String, ByVal url As String)
    Try
        Dim contentHash As String = hashGenerator.ComputeHash(content, "SHA1")
        Dim doc As HtmlDocument = New HtmlDocument()

        doc.Load(New StringReader(content))

        Dim root As HtmlNode = doc.DocumentNode
        Dim anchorTags As New List(Of String)

        For Each link As HtmlNode In root.SelectNodes("//a")
            cururl = link.OuterHtml
            If link.Attributes("href") Is Nothing Then Continue For
            If Uri.IsWellFormedUriString(link.Attributes("href").Value, UriKind.Absolute) Then
                urlQueue.Enqueue(link.Attributes("href").Value)
            Else
                Dim myUri As New Uri(url)
                urlQueue.Enqueue(myUri.Scheme & "://" & myUri.Host & link.Attributes("href").Value)
            End If
        Next
    Catch ex As Exception
        MsgBox(ex.Message, MsgBoxStyle.Critical, "Error (parseHtml(" & url & "))")
    End Try
End Sub

The error I get is:

A first chance exception of type
'System.NullReferenceException'
occurred in Webcrawler.exe Object
reference not set to an instance of an
object.

On the content I try to parse:

�����Iޥ�+�: 8�0�x�

How to check whether the content is 'parse-able' before trying to parse it to prevent the error?

For now it is an image which makes an error popup however I think it might be just anything which isn't (x)html.

Thanks in advance ow great community :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

眼波传意 2024-10-16 09:05:25

在尝试解析返回的数据之前,您需要检查返回的 content-type 标头。

对于 HTML 页面,这应该是 text/html,对于 XHTML 来说,这应该是 application/xhtml+xml

You need to check the returned content-type header before trying to parse the returned data.

For an HTML page this should be text/html, for XHTML is would be application/xhtml+xml.

二智少女 2024-10-16 09:05:25

如果您只有内容(如果您无法像 Oded 建议的那样访问原始 HTTP 标头),您可以假设一个好的 HTML 字符串应该至少包含一个“<”例如,字符串的前 10 个字符内的字符。

当然,这并不能保证,您仍然需要处理极端情况,但这应该丢弃大多数垃圾或意外的内容类型,并让特定的编码字节顺利通过(例如 UTF-8 字节顺序标记等...... )。

If you only have the content (If you can't have access to original HTTP headers like Oded suggested), you could assume a good HTML string should contain at least a "<" character within, say, the 10 first characters of the string.

Of course, there is no guarantee and you will still need to handle the extreme cases, but this should discard most garbage or unexpected content types, and will let specific encoding bytes pass fine (like UTF-8 byte order mark, etc...).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文