当前位置：文江博客话题详情

使用 Python 进行屏幕抓取

发布于 2024-08-20 06:03:55 字数 218 浏览 14 评论 0原文

Python 是否有提供 JavaScript 支持的屏幕抓取库？

我一直使用 pycurl 来处理简单的 HTML 请求，使用 Java 的 HtmlUnit 来处理需要 JavaScript 支持的更复杂的请求。

理想情况下，我希望能够通过 Python 完成所有工作，但我还没有遇到任何允许我这样做的库。它们存在吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少女情怀诗 2024-08-27 06:03:55

处理静态 HTML 时有很多选项，其他响应已涵盖这些选项。但是，如果您需要 JavaScript 支持并希望继续使用 Python，我建议使用 webkit 呈现网页（包括 JavaScript），然后检查生成的 HTML。例如：

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

回复收藏 0 原文

凉风有信 2024-08-27 06:03:55

美丽的汤可能仍然是您最好的选择。

如果您需要“JavaScript 支持”来拦截 Ajax 请求，那么您也应该使用某种捕获（例如 YATT）来监视这些请求是什么，然后模拟/解析它们。

如果您需要“JavaScript 支持”以便能够看到静态 JavaScript 页面的最终结果是什么，那么我的第一选择是尝试根据具体情况找出 JavaScript 正在做什么（例如，如果 JavaScript 正在基于某些 Xml 执行某些操作，则只需直接解析 Xml）

如果您确实想要“JavaScript 支持”（如您希望在页面上运行脚本后查看 html 是什么），那么我认为您可能需要创建某个浏览器控件的实例，然后在完成加载后从浏览器控件读取生成的 html / dom 并使用 beautiful soup 正常解析它。但这将是我最后的手段。

回复收藏 0 原文

余生共白头 2024-08-27 06:03:55

Scrapy 是一种快速的高级屏幕抓取和网络抓取框架，用于抓取网站并从页面中提取结构化数据。它可用于多种用途，从数据挖掘到监控和自动化测试。

给你：http://scrapy.org/

回复收藏 0 原文

丿*梦醉红颜 2024-08-27 06:03:55

也许是硒？它允许您使用 python（以及其他语言）自动化实际的浏览器（Firefox、IE、Safari）。它用于测试网站，但似乎也应该可用于抓取。（免责声明：我自己从未使用过）

回复收藏 0 原文

我做我的改变 2024-08-27 06:03:55

Webscraping 库将 PyQt4 WebView 包装成一个简单易用的 API。

下面是一个简单的示例，用于下载由 WebKit 呈现的网页并使用 XPath 提取标题元素（取自上面的 URL）：

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')

The Webscraping library wraps the PyQt4 WebView into a simple and easy-to-use API.

Here is a simple example to download a web page rendered by WebKit and extract the title element using XPath (taken from the URL above):

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')

回复收藏 0 原文

緦唸λ蓇 2024-08-27 06:03:55

您可以尝试 spidermonkey 吗？

这个Python模块允许执行Javascript吗？
Python 中的类、对象和函数，以及评估
以及 Javascript 脚本和函数的调用。它大量借贷
来自 Claes Jacobssen 的 Javascript Perl 模块，该模块又基于
关于 Mozilla 的 PerlConnect Perl 绑定。