Python 中哪个最好:urllib2、PycURL 还是 mechanize?
好的,我需要使用 Python 下载一些网页,并对我的选项进行了快速调查。
Python 中包含:
urllib - 在我看来,我应该使用 urllib2 代替。 urllib 不支持 cookie,仅支持 HTTP/FTP/本地文件(无 SSL)
urllib2 -完整的 HTTP/FTP 客户端,支持大多数需要的东西,如 cookies,不支持所有 HTTP 动词(仅 GET 和 POST,不支持 TRACE 等)。
功能齐全:
mechanize - 可以使用/保存 Firefox/IE cookie,采取类似第二个链接的操作,积极维护(2011 年 3 月发布 0.2.5)
PycURL - 支持curl 所做的一切(FTP、FTPS、HTTP、HTTPS、GOPHER、TELNET、DICT、FILE 和 LDAP),坏消息:自 9 月以来未更新2008 年 9 月 (7.19.0)
新的可能性:
urllib3 - 支持连接重用/池化和文件发布
已弃用(也称为使用 urllib/urllib2 代替):
httplib - 仅 HTTP/HTTPS (无 FTP)
httplib2 - 仅 HTTP/HTTPS(无 FTP)
urllib2 看起来不错,但我想知道为什么 PycURL 和 mechanize 看起来都很受欢迎,我是否缺少一些东西(即,如果我使用 urllib2,我会在某个时刻把自己逼到角落吗?)。我真的很想得到一些关于这些东西的优点/缺点的反馈,这样我就可以为自己做出最好的选择。
编辑:添加了关于 urllib2 中动词支持的注释
Ok so I need to download some web pages using Python and did a quick investigation of my options.
Included with Python:
urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)
urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)
Full featured:
mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)
PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)
New possibilities:
urllib3 - supports connection re-using/pooling and file posting
Deprecated (a.k.a. use urllib/urllib2 instead):
httplib - HTTP/HTTPS only (no FTP)
httplib2 - HTTP/HTTPS only (no FTP)
The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).
urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.
Edit: added note on verb support in urllib2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我认为这个演讲(在 pycon 2009)可以满足您所寻找的答案(Asheesh Laroia 在这方面拥有丰富的经验)。他指出了您的大部分列表的优点和缺点
不支持的编程网站
期待它(第 1 部分,共 3 部分)
不支持的编程网站
期待它(第 2 部分,共 3 部分)
不支持的编程网站
期待它(第 3 部分,共 3 部分)
从 PYCON 2009 日程表来看:
更新:
Asheesh Laroia 更新了他的 pycon 2010 演示文稿
PyCon 2010:抓取网络:
网站编程策略
没想到
更新 2:
PyCon US 2012 - 网页抓取:可靠、高效地从不需要的页面中提取数据
I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing
programming websites that don't
expect it (Part 1 of 3)
programming websites that don't
expect it (Part 2 of 3)
programming websites that don't
expect it (Part 3 of 3)
From the PYCON 2009 schedule:
Update:
Asheesh Laroia has updated his presentation for pycon 2010
PyCon 2010: Scrape the Web:
Strategies for programming websites
that don't expected it
Update 2:
PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it
Python requests 也是 HTTP 内容的一个很好的候选者。恕我直言,它有一个更好的 api,来自其官方文档的 http 请求示例:
Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:
urllib2
在每个 Python 安装中都可以找到,因此是一个很好的入门基础。PycURL
对于已经习惯使用 libcurl 的人来说非常有用,它公开了更多 HTTP 的低级细节,此外它还获得了应用于 libcurl 的任何修复或改进。mechanize
用于持久驱动连接,就像浏览器一样。这不是一个比另一个更好的问题,而是为工作选择合适的工具的问题。
urllib2
is found in every Python install everywhere, so is a good base upon which to start.PycURL
is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.mechanize
is used to persistently drive a connection much like a browser would.It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.
要“获取一些网页”,请使用请求!
来自 http://docs.python-requests.org/en/latest/ :
To "get some webpages", use requests!
From http://docs.python-requests.org/en/latest/ :
不用担心“最后更新”。 HTTP 在过去几年中没有发生太大变化;)
urllib2 是最好的(因为它是内置的),如果您需要来自 Firefox 的 cookie,则切换到 mechanize。 mechanize 可以用作 urllib2 的直接替代品 - 它们具有类似的方法等。使用 Firefox cookie 意味着您可以使用您的个人登录凭据从网站(例如 StackOverflow)获取内容。只需对您的请求数量负责(否则您将被阻止)。
PycURL 适合那些需要 libcurl 中所有低级内容的人。我会先尝试其他库。
Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)
urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).
PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.
Urllib2 仅支持 HTTP GET 和 POST,可能有解决方法,但如果您的应用程序依赖于其他 HTTP 动词,您可能会更喜欢不同的模块。
Urllib2 only supports HTTP GET and POST, there might be workarounds, but If your app depends on other HTTP verbs, you will probably prefer a different module.
每个使用 HTTP 的 Python 库都有自己的优点。
使用具有特定任务所需的最少功能的功能。
您的列表至少缺少 urllib3 - 一个很酷的第三方 HTTP 库,可以重用 HTTP连接,从而大大加快从同一站点检索多个 URL 的过程。
Every python library that speaks HTTP has its own advantages.
Use the one that has the minimum amount of features necessary for a particular task.
Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.
看看 Grab (http://grablib.org)。它是一个网络库,提供两个主要接口:
1)Grab用于创建网络请求并解析检索到的数据
2) 用于创建批量站点抓取工具的 Spider
在底层 Grab 使用 pycurl 和 lxml,但也可以使用其他网络传输(例如,请求库)。请求传输尚未经过充分测试。
Take a look on Grab (http://grablib.org). It is a network library which provides two main interfaces:
1) Grab for creating network requests and parsing retrieved data
2) Spider for creating bulk site scrapers
Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). Requests transport is not well tested yet.