VB.NET Webbrowser.Document - 你所看到的并不是你能得到的
我尝试编写一个简单的爬虫程序似乎因以下事实而感到困惑:我的目标网页(如出现在 UI 浏览器控件中或通过典型的浏览器应用程序)不能完全作为 HTMLDocument 访问(由于框架、javascript 等) .)
下面的代码执行,甚至可以在控件中看到正确的网页(例如显示项目 50-59 的网页),但我希望检索到的“下一页”超链接是“...&start=60”,我看到了其他东西 - 对应于打开第一个目录页“...&start=10”。 奇怪的是,如果我第二次按下按钮,我确实得到了我正在寻找的东西。对我来说更奇怪的是,如果我插入一个 MsgBox,比如说在我循环等待直到 WebBrowserReadyState.Complete 之后,然后我就得到了我正在寻找的东西。
Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
'start at this URL
'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
catalogPageURL = TextBoxInitialURL.Text
WebBrowser1.Navigate(catalogPageURL)
While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
'Locate the URL associated with the NEXT>> hyperlink
Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
Dim strNextPgLink As String = ""
For Each link As HtmlElement In allLinksInDocument
If link.GetAttribute("className") = "next" Then
strNextPgLink = link.GetAttribute("href")
End If
Next
End Sub
我已经用谷歌搜索了足够多的东西来尝试使用 WebBrowser1.DocumentCompleted 之类的东西 事件,但这仍然不起作用。我尝试过插入睡眠命令。
我避免使用 WebClient 和正则表达式,这是我通常会这样做的方式,因为我确信使用 DOM 对于我以后计划的其他事情会更容易,而且我知道 HTML Agility Pack,但是没有足够的雄心去学习它。因为似乎必须有一种简单的方法来使这个该死的 webbrowser.document 对象与您实际可以看到的内容同步。
如果这是因为 javascript,有没有办法告诉网络浏览器只执行它们?
论坛上的第一个问题,期待更多(希望更聪明)
My attempts at writing a simple crawler seem to be confounded by the fact that my target webpage (as would appear in the UI browser control, or through a typical browser application) is not completely accessible as an HTMLDocument (due to frames, javascript, etc.)
The code below executes, and the correct webpage (e.g. the one displaying items 50-59) can even be seen in the control, but where I would expect the “next page” hyperlink retrieved to be “...&start=60”, I see something else – the one corresponding to opening the first catalog page “...&start=10”.
What is odd, is that if I press the button a second time, I DO get what I’m looking for. Even odder to me, if I inserted a MsgBox, say right after I’ve looped to wait until WebBrowserReadyState.Complete, then I get what I’m looking for.
Private Sub ButtonGo_Click(sender As System.Object, e As System.EventArgs) Handles ButtonGo.Click
'start at this URL
'e.g. http://www.somewebsite.com/properties?l=Dallas+TX&co=US&start=50
catalogPageURL = TextBoxInitialURL.Text
WebBrowser1.Navigate(catalogPageURL)
While WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Application.DoEvents()
End While
'Locate the URL associated with the NEXT>> hyperlink
Dim allLinksInDocument As HtmlElementCollection = WebBrowser1.Document.GetElementsByTagName("a")
Dim strNextPgLink As String = ""
For Each link As HtmlElement In allLinksInDocument
If link.GetAttribute("className") = "next" Then
strNextPgLink = link.GetAttribute("href")
End If
Next
End Sub
I’ve googled around enough to try things like using a WebBrowser1.DocumentCompleted
event, but that still didn’t work. I’ve tried inserting sleep commands.
I’ve avoided using WebClient and regular expressions, the way I would have ordinarily done this, because I’m convinced using the DOM will be easier for other things I have planned down the road, and I’m aware of HTML Agility Pack but not ambitious enough to learn it. Because it seems there has to be a simple way to have this dang webbrowser.document object synchronized with the stuff you can actually see.
If this is because of javascript, is there a way I can tell the webbrowser to just execute them all?
First question on the forum, looking forward to more (smarter ones hopefully)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 webbrowser1.Document 或类似内容时请注意 - 您将不会获得“原始 html”
示例:(假设 wbMain 是一个 Web 浏览器控件)
在此示例中,RTB_RawHTML 的正文标记部分中显示的正文标记中的代码将不会与 RTB_BodyHTML 中显示的 html 完美匹配。通过 (yourwebbrowserhere).Document.Body.OuterHtml 访问它似乎有点“干净”,而不是通过 (yourwebbrowserhere).DocumentText 检索的“原始”html。
当我制作网络抓取工具时,这对我来说是一个问题,因为它会不断地让我失望——有时我会尝试匹配一个标签,它会找到它,而有时它不会找到它,即使我确信它在那里。原因是我试图匹配原始 html,但我需要匹配“清理后”的 html。
我不确定这是否能帮助您隔离问题 - 对我来说确实如此。
Be warned when using webbrowser1.Document or something similar - you will not get 'raw html'
Example: (assume wbMain is a webbrowser control)
in this example, the code in the body tag as displayed in the body tag portion of RTB_RawHTML will NOT perfectly match the html as displayed in RTB_BodyHTML. Accessing it through (yourwebbrowserhere).Document.Body.OuterHtml appears to 'clean' it somewhat as opposed to the 'raw' html as retreived by (yourwebbrowserhere).DocumentText
This was a problem for me when i was making a web scraper, as it would continually throw me off - sometimes i would try to match a tag and it would find it, and other times it wouldnt even though i was sure it was there. The reason was that i was trying to match the raw html, but i needed to match the 'cleaned' html.
Im not sure if this will help you isolate the problem or not - for me it did.