抓取网站中的动态内容
我需要从该网站链接抓取新闻公告。 这些公告似乎是动态生成的。它们没有出现在源代码中。我通常使用机械化,但我认为它不起作用。为此我能做什么呢?我对 python 或 perl 没问题。
I need to scrape news announcements from this website, Link.
The announcements seem to be generated dynamically. They dont appear in the source. I usually use mechanize but I assume it wouldnt work. What can I do for this? I'm ok with python or perl.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果内容是动态生成的,您可以使用
Windmill
或Seleninum
驱动浏览器并在渲染后获取数据。您可以在此处找到示例。
If the content is generated dynamically, you can use
Windmill
orSeleninum
to drive the browser and get the data once it's been rendered.You can find an example here.
礼貌的选择是询问网站所有者是否有 API 允许您访问他们的新闻报道。
不太礼貌的选择是跟踪页面加载时发生的 HTTP 事务,并确定哪一个是提取数据的 AJAX 调用。
看起来就是这个。但看起来它可能包含会话数据,所以我不知道它会继续工作多久。
The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.
The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.
Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.
还有 WWW::Scripter “用于为具有脚本的网站编写脚本” 。从来没有用过它。
There's also WWW::Scripter "For scripting web sites that have scripts" . Never used it.
在 python 中,您可以使用 urllib 和 urllib2 连接到网站并收集数据。例如:
In python you can use urllib and urllib2 to connect to a website and collect data. For example: