需要 JavaScript 支持的网页抓取网站
我只想执行表单输入和网页抓取等任务,但需要在需要 javascript 支持的网站上执行。 而且我还需要在同一个会话中输入表格、抓取等等。 理想情况下,我想要一种从命令行控制网络浏览器的方法。 而且我也想只使用 Linux 来完成这一切,所以我不能使用 .NET。
我找到了 Python 的 webbrowser 库,但它的功能看起来非常有限。 如果它可以与 mechanize 和 BeautifulSoup 交互,那就太棒了。 有什么建议么? 谢谢!
Possible Duplicate:
Screen Scraping from a web page with a lot of Javascript
I just want to do tasks such as form entry and web scraping, but on sites that require javascript support. And I also need to enter forms, scrape, and so on in the same session. Ideally, I'd like a way to control a web browser from the command line. And I also want to only use Linux for all this, so I can't use .NET.
I found the webbrowser library for Python, but its capabilities look very limited. If that could interface with mechanize and BeautifulSoup, that'd be amazing. Any suggestions? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您当然可以使用 Mozilla 编写一个 XUL 应用程序(使用 Firefox、Xulrunner 等运行它),该应用程序为 Web 浏览器编写脚本。 Javascript 通常用于此类任务。
我发现棘手的是抑制浏览器否则会创建的所有类型的对话框 - 您实际上必须覆盖为每种类型的对话调用的 XPCOM 服务器类的行为,并且有很多不同的对话框(例如,如果您的站点决定重定向到证书过期的 https 站点)。
当然,您不应该使用这样的机制来违反任何网站关于机器人使用的政策。 通常您不应该使用机器人提交表单。
You could certainly write a XUL application with Mozilla (run it with Firefox, Xulrunner etc) which scripts a web browser. Javascript is normally used for such tasks.
What I've found is tricky is suppressing all the kinds of dialogue boxes which the browser would otherwise create - you effectively have to override the behaviour of the XPCOM server classes which are invoked for each type of dialogue, and there are a lot of different ones (for example, if your site decides to redirect to a https site with an expired certificate).
Of course you should NOT use such a mechanism to violate any site's policy on use by robots. Normally you should never submit a form with a robot.
这已经被问过了。
This has been asked already.