VB.NET Webbrowser.Document - 你所看到的并不是你能得到的

发布于 2025-01-01 16:45:59 字数 1523 浏览 1 评论 0原文

我尝试编写一个简单的爬虫程序似乎因以下事实而感到困惑:我的目标网页(如出现在 UI 浏览器控件中或通过典型的浏览器应用程序)不能完全作为 HTMLDocument 访问(由于框架、javascript 等) .)

下面的代码执行,甚至可以在控件中看到正确的网页(例如显示项目 50-59 的网页),但我希望检索到的“下一页”超链接是“...&start=60”,我看到了其他东西 - 对应于打开第一个目录页“...&start=10”。 奇怪的是,如果我第二次按下按钮,我确实得到了我正在寻找的东西。对我来说更奇怪的是,如果我插入一个 MsgBox,比如说在我循环等待直到 WebBrowserReadyState.Complete 之后,然后我就得到了我正在寻找的东西。

Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
    'start at this URL
    'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
    catalogPageURL = TextBoxInitialURL.Text
    WebBrowser1.Navigate(catalogPageURL)
    While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    'Locate the URL associated with the NEXT>> hyperlink
    Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
    Dim strNextPgLink As String = ""
    For Each link As HtmlElement In allLinksInDocument
        If link.GetAttribute("className") = "next" Then
            strNextPgLink = link.GetAttribute("href")
        End If
    Next
End Sub

我已经用谷歌搜索了足够多的东西来尝试使用 WebBrowser1.DocumentCompleted 之类的东西 事件,但这仍然不起作用。我尝试过插入睡眠命令。

我避免使用 WebClient 和正则表达式,这是我通常会这样做的方式,因为我确信使用 DOM 对于我以后计划的其他事情会更容易,而且我知道 HTML Agility Pack,但是没有足够的雄心去学习它。因为似乎必须有一种简单的方法来使这个该死的 webbrowser.document 对象与您实际可以看到的内容同步。

如果这是因为 javascript,有没有办法告诉网络浏览器只执行它们?

论坛上的第一个问题,期待更多(希望更聪明)

My attempts at writing a simple crawler seem to be confounded by the fact that my target webpage (as would appear in the UI browser control, or through a typical browser application) is not completely accessible as an HTMLDocument (due to frames, javascript, etc.)

The code below executes, and the correct webpage (e.g. the one displaying items 50-59) can even be seen in the control, but where I would expect the “next page” hyperlink retrieved to be “...&start=60”, I see something else – the one corresponding to opening the first catalog page “...&start=10”.
What is odd, is that if I press the button a second time, I DO get what I’m looking for. Even odder to me, if I inserted a MsgBox, say right after I’ve looped to wait until WebBrowserReadyState.Complete, then I get what I’m looking for.

Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
    'start at this URL
    'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
    catalogPageURL = TextBoxInitialURL.Text
    WebBrowser1.Navigate(catalogPageURL)
    While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    'Locate the URL associated with the NEXT>> hyperlink
    Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
    Dim strNextPgLink As String = ""
    For Each link As HtmlElement In allLinksInDocument
        If link.GetAttribute("className") = "next" Then
            strNextPgLink = link.GetAttribute("href")
        End If
    Next
End Sub

I’ve googled around enough to try things like using a WebBrowser1.DocumentCompleted
event, but that still didn’t work. I’ve tried inserting sleep commands.

I’ve avoided using WebClient and regular expressions, the way I would have ordinarily done this, because I’m convinced using the DOM will be easier for other things I have planned down the road, and I’m aware of HTML Agility Pack but not ambitious enough to learn it. Because it seems there has to be a simple way to have this dang webbrowser.document object synchronized with the stuff you can actually see.

If this is because of javascript, is there a way I can tell the webbrowser to just execute them all?

First question on the forum, looking forward to more (smarter ones hopefully)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

氛圍 2025-01-08 16:45:59

使用 webbrowser1.Document 或类似内容时请注意 - 您将不会获得“原始 html”

示例:(假设 wbMain 是一个 Web 浏览器控件)

    RTB_RawHTML.Text = wbMain.DocumentText
    Try
         RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
    Catch
         debugMessage("Body tag not found.")
    End Try

在此示例中,RTB_RawHTML 的正文标记部分中显示的正文标记中的代码将不会与 RTB_BodyHTML 中显示的 html 完美匹配。通过 (yourwebbrowserhere).Document.Body.OuterHtml 访问它似乎有点“干净”,而不是通过 (yourwebbrowserhere).DocumentText 检索的“原始”html。

当我制作网络抓取工具时,这对我来说是一个问题,因为它会不断地让我失望——有时我会尝试匹配一个标签,它会找到它,而有时它不会找到它,即使我确信它在那里。原因是我试图匹配原始 html,但我需要匹配“清理后”的 html。

我不确定这是否能帮助您隔离问题 - 对我来说确实如此。

Be warned when using webbrowser1.Document or something similar - you will not get 'raw html'

Example: (assume wbMain is a webbrowser control)

    RTB_RawHTML.Text = wbMain.DocumentText
    Try
         RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
    Catch
         debugMessage("Body tag not found.")
    End Try

in this example, the code in the body tag as displayed in the body tag portion of RTB_RawHTML will NOT perfectly match the html as displayed in RTB_BodyHTML. Accessing it through (yourwebbrowserhere).Document.Body.OuterHtml appears to 'clean' it somewhat as opposed to the 'raw' html as retreived by (yourwebbrowserhere).DocumentText

This was a problem for me when i was making a web scraper, as it would continually throw me off - sometimes i would try to match a tag and it would find it, and other times it wouldnt even though i was sure it was there. The reason was that i was trying to match the raw html, but i needed to match the 'cleaned' html.

Im not sure if this will help you isolate the problem or not - for me it did.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文