Python 中哪个最好:urllib2、PycURL 还是 mechanize?

发布于 2024-08-23 17:27:41 字数 1441 浏览 13 评论 0原文

好的,我需要使用 Python 下载一些网页,并对我的选项进行了快速调查。

Python 中包含:

urllib - 在我看来,我应该使用 urllib2 代替。 urllib 不支持 cookie,仅支持 HTTP/FTP/本地文件(无 SSL)

urllib2 -完整的 HTTP/FTP 客户端,支持大多数需要的东西,如 cookies,不支持所有 HTTP 动词(仅 GET 和 POST,不支持 TRACE 等)。

功能齐全:

mechanize - 可以使用/保存 Firefox/IE cookie,采取类似第二个链接的操作,积极维护(2011 年 3 月发布 0.2.5)

PycURL - 支持curl 所做的一切(FTP、FTPS、HTTP、HTTPS、GOPHER、TELNET、DICT、FILE 和 LDAP),坏消息:自 9 月以来未更新2008 年 9 月 (7.19.0)

新的可能性:

urllib3 - 支持连接重用/池化和文件发布

已弃用(也称为使用 urllib/urllib2 代替):

httplib - 仅 HTTP/HTTPS (无 FTP)

httplib2 - 仅 HTTP/HTTPS(无 FTP)

urllib/urllib2/PycURL/mechanize 都是相当成熟且运行良好的解决方案。 mechanize 和 PycURL 随许多 Linux 发行版(例如 Fedora 13)和 BSD 一起提供,因此安装通常不是问题(所以这很好)。

urllib2 看起来不错,但我想知道为什么 PycURL 和 mechanize 看起来都很受欢迎,我是否缺少一些东西(即,如果我使用 urllib2,我会在某个时刻把自己逼到角落吗?)。我真的很想得到一些关于这些东西的优点/缺点的反馈,这样我就可以为自己做出最好的选择。

编辑:添加了关于 urllib2 中动词支持的注释

Ok so I need to download some web pages using Python and did a quick investigation of my options.

Included with Python:

urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)

urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)

Full featured:

mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)

PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)

New possibilities:

urllib3 - supports connection re-using/pooling and file posting

Deprecated (a.k.a. use urllib/urllib2 instead):

httplib - HTTP/HTTPS only (no FTP)

httplib2 - HTTP/HTTPS only (no FTP)

The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).

urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.

Edit: added note on verb support in urllib2

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

愿与i 2024-08-30 17:27:41

我认为这个演讲(在 pycon 2009)可以满足您所寻找的答案(Asheesh Laroia 在这方面拥有丰富的经验)。他指出了您的大部分列表的优点和缺点

从 PYCON 2009 日程表来看:

您是否发现自己面临着
拥有您需要的数据的网站
提炼?
如果你的生活会更简单吗
您可以通过编程方式输入数据
进入网络应用程序,甚至那些
调整为抵抗机器人交互?

我们将讨论网络基础知识
刮擦,然后潜入
不同方法的详细信息以及在哪里
它们是最适用的。

你会离开
了解何时申请
不同的工具,并了解
屏幕抓取的“重锤”
我在一个项目中捡到的
电子前沿基金会。

与会者应携带笔记本电脑,如果
可能的话,尝试一下我们的例子
讨论并可选择做笔记。

更新:
Asheesh Laroia 更新了他的 pycon 2010 演示文稿

  • PyCon 2010:抓取网络:
    网站编程策略
    没想到

    * 我的座右铭:“网站就是 API。”
    * 选择解析器:BeautifulSoup、lxml、HTMLParse 和 html5lib。
    * 提取信息,即使面对糟糕的 HTML:正则表达式、BeautifulSoup、SAX 和 XPath。
    * 自动模板逆向工程工具。
    * 提交表格。
    * 玩转 XML-RPC
    * 不要成为邪恶的评论垃圾邮件发送者。
    * 对策及规避:
          o IP 地址限制
          o 隐藏表单字段
          o 用户代理检测
          JavaScript
          o 验证码 
    * 大量完整的源代码和工作示例:
          o 提交文本转语音表格。
          o 从网上商店下载音乐。
          o 使用 Selenium RC 自动化 Firefox 以导航纯 JavaScript 服务。 
    * 问答;和研讨会
    * 用你的力量行善,而不是作恶。 
    

更新 2:

PyCon US 2012 - 网页抓取:可靠、高效地从不需要的页面中提取数据

令人兴奋的信息被隐藏在网页和 HTML 表单后面。在本教程中,您将学习如何解析这些页面以及何时应用高级技术来使抓取更快、更稳定。我们将介绍 Twisted、gevent 等的并行下载;分析 SSL 背后的网站;使用 Selenium 驱动 JavaScript-y 网站;以及>规避常见的反抓取技术。

I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing

From the PYCON 2009 schedule:

Do you find yourself faced with
websites that have data you need to
extract?
Would your life be simpler if
you could programmatically input data
into web applications, even those
tuned to resist interaction by bots?

We'll discuss the basics of web
scraping, and then dive into the
details of different methods and where
they are most applicable.

You'll leave
with an understanding of when to apply
different tools, and learn about a
"heavy hammer" for screen scraping
that I picked up at a project for the
Electronic Frontier Foundation.

Atendees should bring a laptop, if
possible, to try the examples we
discuss and optionally take notes.

Update:
Asheesh Laroia has updated his presentation for pycon 2010

  • PyCon 2010: Scrape the Web:
    Strategies for programming websites
    that don't expected it

    * My motto: "The website is the API."
    * Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
    * Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
    * Automatic template reverse-engineering tools.
    * Submitting to forms.
    * Playing with XML-RPC
    * DO NOT BECOME AN EVIL COMMENT SPAMMER.
    * Countermeasures, and circumventing them:
          o IP address limits
          o Hidden form fields
          o User-agent detection
          o JavaScript
          o CAPTCHAs 
    * Plenty of full source code to working examples:
          o Submitting to forms for text-to-speech.
          o Downloading music from web stores.
          o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. 
    * Q&A; and workshopping
    * Use your power for good, not evil. 
    

Update 2:

PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, >you'll learn how to parse those pages and when to apply advanced techniques that make >scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, >and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and >evading common anti-scraping techniques.

月牙弯弯 2024-08-30 17:27:41

Python requests 也是 HTTP 内容的一个很好的候选者。恕我直言,它有一个更好的 api,来自其官方文档的 http 请求示例:

>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...

Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:

>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...
羞稚 2024-08-30 17:27:41
  • urllib2 在每个 Python 安装中都可以找到,因此是一个很好的入门基础。
  • PycURL 对于已经习惯使用 libcurl 的人来说非常有用,它公开了更多 HTTP 的低级细节,此外它还获得了应用于 libcurl 的任何修复或改进。
  • mechanize 用于持久驱动连接,就像浏览器一样。

这不是一个比另一个更好的问题,而是为工作选择合适的工具的问题。

  • urllib2 is found in every Python install everywhere, so is a good base upon which to start.
  • PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.
  • mechanize is used to persistently drive a connection much like a browser would.

It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.

双手揣兜 2024-08-30 17:27:41

要“获取一些网页”,请使用请求!

来自 http://docs.python-requests.org/en/latest/

Python的标准urllib2模块提供了大部分HTTP
您需要的功能,但 API 已彻底损坏。它被建造了
不同的时间和不同的网络。它需要巨大的
执行最简单的工作量(甚至方法覆盖)
任务。

事情不应该是这样的。不在 Python 中。

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

To "get some webpages", use requests!

From http://docs.python-requests.org/en/latest/ :

Python’s standard urllib2 module provides most of the HTTP
capabilities you need, but the API is thoroughly broken. It was built
for a different time — and a different web. It requires an enormous
amount of work (even method overrides) to perform the simplest of
tasks.

Things shouldn’t be this way. Not in Python.

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
冬天旳寂寞 2024-08-30 17:27:41

不用担心“最后更新”。 HTTP 在过去几年中没有发生太大变化;)

urllib2 是最好的(因为它是内置的),如果您需要来自 Firefox 的 cookie,则切换到 mechanize。 mechanize 可以用作 urllib2 的直接替代品 - 它们具有类似的方法等。使用 Firefox cookie 意味着您可以使用您的个人登录凭据从网站(例如 StackOverflow)获取内容。只需对您的请求数量负责(否则您将被阻止)。

PycURL 适合那些需要 libcurl 中所有低级内容的人。我会先尝试其他库。

Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)

urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).

PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.

辞旧 2024-08-30 17:27:41

Urllib2 仅支持 HTTP GET 和 POST,可能有解决方法,但如果您的应用程序依赖于其他 HTTP 动词,您可能会更喜欢不同的模块。

Urllib2 only supports HTTP GET and POST, there might be workarounds, but If your app depends on other HTTP verbs, you will probably prefer a different module.

岛徒 2024-08-30 17:27:41

每个使用 HTTP 的 Python 库都有自己的优点。

使用具有特定任务所需的最少功能的功能。

您的列表至少缺少 urllib3 - 一个很酷的第三方 HTTP 库,可以重用 HTTP连接,从而大大加快从同一站点检索多个 URL 的过程。

Every python library that speaks HTTP has its own advantages.

Use the one that has the minimum amount of features necessary for a particular task.

Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.

只为一人 2024-08-30 17:27:41

看看 Grab (http://grablib.org)。它是一个网络库,提供两个主要接口:
1)Grab用于创建网络请求并解析检索到的数据
2) 用于创建批量站点抓取工具的 Spider

在底层 Grab 使用 pycurl 和 lxml,但也可以使用其他网络传输(例如,请求库)。请求传输尚未经过充分测试。

Take a look on Grab (http://grablib.org). It is a network library which provides two main interfaces:
1) Grab for creating network requests and parsing retrieved data
2) Spider for creating bulk site scrapers

Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). Requests transport is not well tested yet.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文