抓取有问题的网站

发布于 2024-08-08 04:34:31 字数 451 浏览 8 评论 0原文

我正在尝试从网站上抓取一些信息,但在阅读相关页面时遇到问题。这些页面似乎首先发送基本设置,然后发送更详细的信息。我的下载尝试似乎只捕获了基本设置。到目前为止我已经尝试过 urllib 和 mechanize 。

Firefox 和 Chrome 可以毫无问题地显示页面,尽管当我查看页面源代码时看不到我想要的部分。

示例网址为 https://personal.vanguard.com/ us/funds/snapshot?FundId=0542&FundIntExt=INT

例如,我想要页面右下角的平均到期日和平均持续时间。问题不在于从页面中提取该信息,而是下载页面以便我可以提取信息。

I'm trying to scrape some information from a web site, but am having trouble reading the relevant pages. The pages seem to first send a basic setup, then more detailed info. My download attempts only seem to capture the basic setup. I've tried urllib and mechanize so far.

Firefox and Chrome have no trouble displaying the pages, although I can't see the parts I want when I view page source.

A sample url is https://personal.vanguard.com/us/funds/snapshot?FundId=0542&FundIntExt=INT

I'd like, for example, average maturity and average duration from the lower right of the page. The problem isn't extracting that info from the page, it's downloading the page so that I can extract the info.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

橘味果▽酱 2024-08-15 04:34:31

该页面使用 JavaScript 来加载数据。 Firefox 和 Chrome 之所以能工作,是因为您启用了 JavaScript - 尝试禁用它,您将得到一个几乎是空的页面。

Python 本身无法做到这一点 - 最好的妥协是使用类似 Pamie

The page uses JavaScript to load the data. Firefox and Chrome are only working because you have JavaScript enabled - try disabling it and you'll get a mostly empty page.

Python isn't going to be able to do this by itself - your best compromise would be to control a real browser (Internet Explorer is easiest, if you're on Windows) from Python using something like Pamie.

莫多说 2024-08-15 04:34:31

网站通过ajax加载数据。 Firebug 显示 ajax 调用。对于给定页面,数据从 https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

查看原页面对应的javascript代码:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>

The website loads the data via ajax. Firebug shows the ajax calls. For the given page, the data is loaded from https://personal.vanguard.com/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542

See the corresponding javascript code on the original page:

<script>populator = new Populator({parentId:
"profileForm:vanguardFundTabBox:tab0",execOnLoad:true,
 populatorUrl:"/us/JSP/Funds/VGITab/VGIFundOverviewTabContent.jsf?FundIntExt=INT&FundId=0542",
inline:fals   e,type:"once"});
</script>
╰◇生如夏花灿烂 2024-08-15 04:34:31

原因是它在加载后执行 AJAX 调用。您还需要考虑搜索这些 URL 来抓取其内容。

The reason why is because it's performing AJAX calls after it loads. You will need to account for searching out those URLs to scrape it's content as well.

萌无敌 2024-08-15 04:34:31

正如 RichieHindle 提到的,在 Windows 上最好的选择是使用 WebBrowser 类创建 IE 渲染引擎的实例,然后使用它来浏览站点。

该类使您可以完全访问 DOM 树,因此您可以用它做任何您想做的事情。

http://msdn.microsoft.com /en-us/library/system.windows.forms.webbrowser(loband).aspx

As RichieHindle mentioned, your best bet on Windows is to use the WebBrowser class to create an instance of an IE rendering engine and then use that to browse the site.

The class gives you full access to the DOM tree, so you can do whatever you want with it.

http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser(loband).aspx

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文