使用 Python 进行屏幕抓取
Python 是否有提供 JavaScript 支持的屏幕抓取库?
我一直使用 pycurl 来处理简单的 HTML 请求,使用 Java 的 HtmlUnit 来处理需要 JavaScript 支持的更复杂的请求。
理想情况下,我希望能够通过 Python 完成所有工作,但我还没有遇到任何允许我这样做的库。它们存在吗?
Does Python have screen scraping libraries that offer JavaScript support?
I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.
Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
处理静态 HTML 时有很多选项,其他响应已涵盖这些选项。但是,如果您需要 JavaScript 支持并希望继续使用 Python,我建议使用 webkit 呈现网页(包括 JavaScript),然后检查生成的 HTML。例如:
There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:
美丽的汤可能仍然是您最好的选择。
如果您需要“JavaScript 支持”来拦截 Ajax 请求,那么您也应该使用某种捕获(例如 YATT)来监视这些请求是什么,然后模拟/解析它们。
如果您需要“JavaScript 支持”以便能够看到静态 JavaScript 页面的最终结果是什么,那么我的第一选择是尝试根据具体情况找出 JavaScript 正在做什么(例如,如果 JavaScript 正在基于某些 Xml 执行某些操作,则只需直接解析 Xml)
如果您确实想要“JavaScript 支持”(如您希望在页面上运行脚本后查看 html 是什么),那么我认为您可能需要创建某个浏览器控件的实例,然后在完成加载后从浏览器控件读取生成的 html / dom 并使用 beautiful soup 正常解析它。但这将是我最后的手段。
Beautiful soup is still probably your best bet.
If you need "JavaScript support" for the purpose of intercepting Ajax requests then you should use some sort of capture too (such as YATT) to monitor what those requests are, and then emulating / parsing them.
If you need "JavaScript support" in order to be able to see what the end result of a page with static JavaScript is, then my first choice would be to try and figure out what the JavaScript is doing on a case-by-case basis (e.g. if the JavaScript is doing something based on some Xml, then just parse the Xml directly instead)
If you really want "JavaScript support" (as in you want to see what the html is after scripts have been run on a page) then I think you will probably need to create an instance of some browser control, and then read the resulting html / dom back from the browser control once its finished loading and parse it normally with beautiful soup. That would be my last resort however.
给你:http://scrapy.org/
Here you go: http://scrapy.org/
也许是硒?它允许您使用 python(以及其他语言)自动化实际的浏览器(Firefox、IE、Safari)。它用于测试网站,但似乎也应该可用于抓取。 (免责声明:我自己从未使用过)
Selenium maybe? It allows you to automate an actual browser (Firefox, IE, Safari) using python (amongst other languages). It is meant for testing websites, but seems it should be usable for scraping as well. (disclaimer: never used it myself)
Webscraping 库将 PyQt4 WebView 包装成一个简单易用的 API。
下面是一个简单的示例,用于下载由 WebKit 呈现的网页并使用 XPath 提取标题元素(取自上面的 URL):
The Webscraping library wraps the PyQt4 WebView into a simple and easy-to-use API.
Here is a simple example to download a web page rendered by WebKit and extract the title element using XPath (taken from the URL above):
您可以尝试 spidermonkey 吗?
you can try spidermonkey ?
我还没有找到任何东西。我结合使用 beautifulsoup 和自定义例程......
I have not found anything for this. I use a combination of beautifulsoup and custom routines...