VB.NET Webbrowser.Document - 你所看到的并不是你能得到的

发布于 2025-01-01 16:45:59 字数 1523 浏览 1 评论 0原文

我尝试编写一个简单的爬虫程序似乎因以下事实而感到困惑：我的目标网页（如出现在 UI 浏览器控件中或通过典型的浏览器应用程序）不能完全作为 HTMLDocument 访问（由于框架、javascript 等） .)

下面的代码执行，甚至可以在控件中看到正确的网页（例如显示项目 50-59 的网页），但我希望检索到的“下一页”超链接是“...&start=60”，我看到了其他东西 - 对应于打开第一个目录页“...&start=10”。奇怪的是，如果我第二次按下按钮，我确实得到了我正在寻找的东西。对我来说更奇怪的是，如果我插入一个 MsgBox，比如说在我循环等待直到 WebBrowserReadyState.Complete 之后，然后我就得到了我正在寻找的东西。

Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
    'start at this URL
    'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
    catalogPageURL = TextBoxInitialURL.Text
    WebBrowser1.Navigate(catalogPageURL)
    While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    'Locate the URL associated with the NEXT>> hyperlink
    Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
    Dim strNextPgLink As String = ""
    For Each link As HtmlElement In allLinksInDocument
        If link.GetAttribute("className") = "next" Then
            strNextPgLink = link.GetAttribute("href")
        End If
    Next
End Sub

我已经用谷歌搜索了足够多的东西来尝试使用 WebBrowser1.DocumentCompleted 之类的东西事件，但这仍然不起作用。我尝试过插入睡眠命令。

我避免使用 WebClient 和正则表达式，这是我通常会这样做的方式，因为我确信使用 DOM 对于我以后计划的其他事情会更容易，而且我知道 HTML Agility Pack，但是没有足够的雄心去学习它。因为似乎必须有一种简单的方法来使这个该死的 webbrowser.document 对象与您实际可以看到的内容同步。

如果这是因为 javascript，有没有办法告诉网络浏览器只执行它们？

论坛上的第一个问题，期待更多（希望更聪明）

原文

My attempts at writing a simple crawler seem to be confounded by the fact that my target webpage (as would appear in the UI browser control, or through a typical browser application) is not completely accessible as an HTMLDocument (due to frames, javascript, etc.)

The code below executes, and the correct webpage (e.g. the one displaying items 50-59) can even be seen in the control, but where I would expect the “next page” hyperlink retrieved to be “...&start=60”, I see something else – the one corresponding to opening the first catalog page “...&start=10”.
What is odd, is that if I press the button a second time, I DO get what I’m looking for. Even odder to me, if I inserted a MsgBox, say right after I’ve looped to wait until WebBrowserReadyState.Complete, then I get what I’m looking for.

Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
    'start at this URL
    'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
    catalogPageURL = TextBoxInitialURL.Text
    WebBrowser1.Navigate(catalogPageURL)
    While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
        Application.DoEvents()
    End While
    'Locate the URL associated with the NEXT>> hyperlink
    Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
    Dim strNextPgLink As String = ""
    For Each link As HtmlElement In allLinksInDocument
        If link.GetAttribute("className") = "next" Then
            strNextPgLink = link.GetAttribute("href")
        End If
    Next
End Sub

I’ve googled around enough to try things like using a WebBrowser1.DocumentCompleted
event, but that still didn’t work. I’ve tried inserting sleep commands.

I’ve avoided using WebClient and regular expressions, the way I would have ordinarily done this, because I’m convinced using the DOM will be easier for other things I have planned down the road, and I’m aware of HTML Agility Pack but not ambitious enough to learn it. Because it seems there has to be a simple way to have this dang webbrowser.document object synchronized with the stuff you can actually see.

If this is because of javascript, is there a way I can tell the webbrowser to just execute them all?

First question on the forum, looking forward to more (smarter ones hopefully)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

氛圍 2025-01-08 16:45:59

使用 webbrowser1.Document 或类似内容时请注意 - 您将不会获得“原始 html”

示例：（假设 wbMain 是一个 Web 浏览器控件）

    RTB_RawHTML.Text = wbMain.DocumentText
    Try
         RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
    Catch
         debugMessage("Body tag not found.")
    End Try

在此示例中，RTB_RawHTML 的正文标记部分中显示的正文标记中的代码将不会与 RTB_BodyHTML 中显示的 html 完美匹配。通过 (yourwebbrowserhere).Document.Body.OuterHtml 访问它似乎有点“干净”，而不是通过 (yourwebbrowserhere).DocumentText 检索的“原始”html。

当我制作网络抓取工具时，这对我来说是一个问题，因为它会不断地让我失望——有时我会尝试匹配一个标签，它会找到它，而有时它不会找到它，即使我确信它在那里。原因是我试图匹配原始 html，但我需要匹配“清理后”的 html。

我不确定这是否能帮助您隔离问题 - 对我来说确实如此。

Be warned when using webbrowser1.Document or something similar - you will not get 'raw html'

Example: (assume wbMain is a webbrowser control)

    RTB_RawHTML.Text = wbMain.DocumentText
    Try
         RTB_BodyHTML.Text = wbMain.Document.Body.OuterHtml
    Catch
         debugMessage("Body tag not found.")
    End Try

in this example, the code in the body tag as displayed in the body tag portion of RTB_RawHTML will NOT perfectly match the html as displayed in RTB_BodyHTML. Accessing it through (yourwebbrowserhere).Document.Body.OuterHtml appears to 'clean' it somewhat as opposed to the 'raw' html as retreived by (yourwebbrowserhere).DocumentText

This was a problem for me when i was making a web scraper, as it would continually throw me off - sometimes i would try to match a tag and it would find it, and other times it wouldnt even though i was sure it was there. The reason was that i was trying to match the raw html, but i needed to match the 'cleaned' html.

Im not sure if this will help you isolate the problem or not - for me it did.

回复收藏 0 原文

~没有更多了~