抓取有问题的网站
我正在尝试从网站上抓取一些信息,但在阅读相关页面时遇到问题。这些页面似乎首先发送基本设置,然后发送更详细的信息。我的下载尝试似乎只捕获了基本设置。到目前为止我已经尝试过 urllib 和 mechanize 。
Firefox 和 Chrome 可以毫无问题地显示页面,尽管当我查看页面源代码时看不到我想要的部分。
示例网址为 https://personal.vanguard.com/ us/funds/snapshot?FundId=0542&FundIntExt=INT
例如,我想要页面右下角的平均到期日和平均持续时间。问题不在于从页面中提取该信息,而是下载页面以便我可以提取信息。
I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.
Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.
A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT
I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该页面使用 JavaScript 来加载数据。 Firefox 和 Chrome 之所以能工作,是因为您启用了 JavaScript - 尝试禁用它,您将得到一个几乎是空的页面。
Python 本身无法做到这一点 - 最好的妥协是使用类似 Pamie。
The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.
Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.
网站通过ajax加载数据。 Firebug 显示 ajax 调用。对于给定页面,数据从 https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
查看原页面对应的javascript代码:
The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542
See the corresponding javascript code on the original page:
原因是它在加载后执行 AJAX 调用。您还需要考虑搜索这些 URL 来抓取其内容。
The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.
正如 RichieHindle 提到的,在 Windows 上最好的选择是使用 WebBrowser 类创建 IE 渲染引擎的实例,然后使用它来浏览站点。
该类使您可以完全访问 DOM 树,因此您可以用它做任何您想做的事情。
http://msdn.microsoft.com /en-us/library/system.windows.forms.webbrowser(loband).aspx
As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.
The class gives you full access to the DOM tree, so you can do whatever you want with it.
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx