使用 Python 进行屏幕抓取

发布于 2024-08-20 06:03:55 字数 218 浏览 14 评论 0原文

Python 是否有提供 JavaScript 支持的屏幕抓取库?

我一直使用 pycurl 来处理简单的 HTML 请求,使用 Java 的 HtmlUnit 来处理需要 JavaScript 支持的更复杂的请求。

理想情况下,我希望能够通过 Python 完成所有工作,但我还没有遇到任何允许我这样做的库。它们存在吗?

Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

少女情怀诗 2024-08-27 06:03:55

处理静态 HTML 时有很多选项,其他响应已涵盖这些选项。但是,如果您需要 JavaScript 支持并希望继续使用 Python,我建议使用 webkit 呈现网页(包括 JavaScript),然后检查生成的 HTML。例如:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html
凉风有信 2024-08-27 06:03:55

美丽的汤可能仍然是您最好的选择。

如果您需要“JavaScript 支持”来拦截 Ajax 请求,那么您也应该使用某种捕获(例如 YATT)来监视这些请求是什么,然后模拟/解析它们。

如果您需要“JavaScript 支持”以便能够看到静态 JavaScript 页面的最终结果是什么,那么我的第一选择是尝试根据具体情况找出 JavaScript 正在做什么(例如,如果 JavaScript 正在基于某些 Xml 执行某些操作,则只需直接解析 Xml)

如果您确实想要“JavaScript 支持”(如您希望在页面上运行脚本后查看 html 是什么),那么我认为您可能需要创建某个浏览器控件的实例,然后在完成加载后从浏览器控件读取生成的 html / dom 并使用 beautiful soup 正常解析它。但这将是我最后的手段。

Beautiful soup is still probably your best bet.

If you need "JavaScript support" for the purpose of intercepting Ajax requests then you should use some sort of capture too (such as YATT) to monitor what those requests are, and then emulating / parsing them.

If you need "JavaScript support" in order to be able to see what the end result of a page with static JavaScript is, then my first choice would be to try and figure out what the JavaScript is doing on a case-by-case basis (e.g. if the JavaScript is doing something based on some Xml, then just parse the Xml directly instead)

If you really want "JavaScript support" (as in you want to see what the html is after scripts have been run on a page) then I think you will probably need to create an instance of some browser control, and then read the resulting html / dom back from the browser control once its finished loading and parse it normally with beautiful soup. That would be my last resort however.

余生共白头 2024-08-27 06:03:55

Scrapy 是一种快速的高级屏幕抓取和网络抓取框架,用于抓取网站并从页面中提取结构化数据。它可用于多种用途,从数据挖掘到监控和自动化测试。

给你:http://scrapy.org/

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Here you go: http://scrapy.org/

丿*梦醉红颜 2024-08-27 06:03:55

也许是?它允许您使用 python(以及其他语言)自动化实际的浏览器(Firefox、IE、Safari)。它用于测试网站,但似乎也应该可用于抓取。 (免责声明:我自己从未使用过)

Selenium maybe? It allows you to automate an actual browser (Firefox, IE, Safari) using python (amongst other languages). It is meant for testing websites, but seems it should be usable for scraping as well. (disclaimer: never used it myself)

我做我的改变 2024-08-27 06:03:55

Webscraping 库将 PyQt4 WebView 包装成一个简单易用的 API。

下面是一个简单的示例,用于下载由 WebKit 呈现的网页并使用 XPath 提取标题元素(取自上面的 URL):

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')

The Webscraping library wraps the PyQt4 WebView into a simple and easy-to-use API.

Here is a simple example to download a web page rendered by WebKit and extract the title element using XPath (taken from the URL above):

from webscraping import download, xpath
D = download.Download()
# download and cache the Google Code webpage
html = D.get('http://code.google.com/p/webscraping')
# use xpath to extract the project title
print xpath.get(html, '//div[@id="pname"]/a/span')
緦唸λ蓇 2024-08-27 06:03:55

您可以尝试 spidermonkey 吗?

这个Python模块允许执行Javascript吗?
Python 中的类、对象和函数,以及评估
以及 Javascript 脚本和函数的调用。它大量借贷
来自 Claes Jacobssen 的 Javascript Perl 模块,该模块又基于
关于 Mozilla 的 PerlConnect Perl 绑定。

you can try spidermonkey ?

This Python module allows for the implementation of Javascript?
classes, objects and functions in Python, as well as the evaluation
and calling of Javascript scripts and functions. It borrows heavily
from Claes Jacobssen's Javascript Perl module, which in turn is based
on Mozilla's PerlConnect Perl binding.

想你只要分分秒秒 2024-08-27 06:03:55

我还没有找到任何东西。我结合使用 beautifulsoup 和自定义例程......

I have not found anything for this. I use a combination of beautifulsoup and custom routines...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文