当前位置：文江博客话题详情

Python 中哪个最好：urllib2、PycURL 还是 mechanize？

发布于 2024-08-23 17:27:41 字数 1441 浏览 13 评论 0原文

好的，我需要使用 Python 下载一些网页，并对我的选项进行了快速调查。

Python 中包含：

urllib - 在我看来，我应该使用 urllib2 代替。 urllib 不支持 cookie，仅支持 HTTP/FTP/本地文件（无 SSL）

urllib2 -完整的 HTTP/FTP 客户端，支持大多数需要的东西，如 cookies，不支持所有 HTTP 动词（仅 GET 和 POST，不支持 TRACE 等）。

功能齐全：

mechanize - 可以使用/保存 Firefox/IE cookie，采取类似第二个链接的操作，积极维护（2011 年 3 月发布 0.2.5）

PycURL - 支持curl 所做的一切（FTP、FTPS、HTTP、HTTPS、GOPHER、TELNET、DICT、FILE 和 LDAP），坏消息：自 9 月以来未更新2008 年 9 月 (7.19.0)

新的可能性：

urllib3 - 支持连接重用/池化和文件发布

已弃用（也称为使用 urllib/urllib2 代替）：

httplib - 仅 HTTP/HTTPS （无 FTP）

httplib2 - 仅 HTTP/HTTPS（无 FTP）

urllib/urllib2/PycURL/mechanize 都是相当成熟且运行良好的解决方案。 mechanize 和 PycURL 随许多 Linux 发行版（例如 Fedora 13）和 BSD 一起提供，因此安装通常不是问题（所以这很好）。

urllib2 看起来不错，但我想知道为什么 PycURL 和 mechanize 看起来都很受欢迎，我是否缺少一些东西（即，如果我使用 urllib2，我会在某个时刻把自己逼到角落吗？）。我真的很想得到一些关于这些东西的优点/缺点的反馈，这样我就可以为自己做出最好的选择。

编辑：添加了关于 urllib2 中动词支持的注释

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

愿与i 2024-08-30 17:27:41

我认为这个演讲（在 pycon 2009）可以满足您所寻找的答案（Asheesh Laroia 在这方面拥有丰富的经验）。他指出了您的大部分列表的优点和缺点

从 PYCON 2009 日程表来看：

您是否发现自己面临着
拥有您需要的数据的网站
提炼？
如果你的生活会更简单吗
您可以通过编程方式输入数据
进入网络应用程序，甚至那些
调整为抵抗机器人交互？
我们将讨论网络基础知识
刮擦，然后潜入
不同方法的详细信息以及在哪里
它们是最适用的。
你会离开
了解何时申请
不同的工具，并了解
屏幕抓取的“重锤”
我在一个项目中捡到的
电子前沿基金会。
与会者应携带笔记本电脑，如果
可能的话，尝试一下我们的例子
讨论并可选择做笔记。

更新：
Asheesh Laroia 更新了他的 pycon 2010 演示文稿

PyCon 2010：抓取网络：
网站编程策略
没想到

* 我的座右铭：“网站就是 API。”
* 选择解析器：BeautifulSoup、lxml、HTMLParse 和 html5lib。
* 提取信息，即使面对糟糕的 HTML：正则表达式、BeautifulSoup、SAX 和 XPath。
* 自动模板逆向工程工具。
* 提交表格。
* 玩转 XML-RPC
* 不要成为邪恶的评论垃圾邮件发送者。
* 对策及规避：
      o IP 地址限制
      o 隐藏表单字段
      o 用户代理检测
      JavaScript
      o 验证码 
* 大量完整的源代码和工作示例：
      o 提交文本转语音表格。
      o 从网上商店下载音乐。
      o 使用 Selenium RC 自动化 Firefox 以导航纯 JavaScript 服务。 
* 问答；和研讨会
* 用你的力量行善，而不是作恶。

更新 2：

PyCon US 2012 - 网页抓取：可靠、高效地从不需要的页面中提取数据

令人兴奋的信息被隐藏在网页和 HTML 表单后面。在本教程中，您将学习如何解析这些页面以及何时应用高级技术来使抓取更快、更稳定。我们将介绍 Twisted、gevent 等的并行下载；分析 SSL 背后的网站；使用 Selenium 驱动 JavaScript-y 网站；以及>规避常见的反抓取技术。

I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing

From the PYCON 2009 schedule:

Do you find yourself faced with
websites that have data you need to
extract?
Would your life be simpler if
you could programmatically input data
into web applications, even those
tuned to resist interaction by bots?
We'll discuss the basics of web
scraping, and then dive into the
details of different methods and where
they are most applicable.
You'll leave
with an understanding of when to apply
different tools, and learn about a
"heavy hammer" for screen scraping
that I picked up at a project for the
Electronic Frontier Foundation.
Atendees should bring a laptop, if
possible, to try the examples we
discuss and optionally take notes.

Update:
Asheesh Laroia has updated his presentation for pycon 2010

PyCon 2010: Scrape the Web:
Strategies for programming websites
that don't expected it

* My motto: "The website is the API."
* Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
* Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
* Automatic template reverse-engineering tools.
* Submitting to forms.
* Playing with XML-RPC
* DO NOT BECOME AN EVIL COMMENT SPAMMER.
* Countermeasures, and circumventing them:
      o IP address limits
      o Hidden form fields
      o User-agent detection
      o JavaScript
      o CAPTCHAs 
* Plenty of full source code to working examples:
      o Submitting to forms for text-to-speech.
      o Downloading music from web stores.
      o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. 
* Q&A; and workshopping
* Use your power for good, not evil.

Update 2:

PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, >you'll learn how to parse those pages and when to apply advanced techniques that make >scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, >and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and >evading common anti-scraping techniques.

回复收藏 0 原文

月牙弯弯 2024-08-30 17:27:41

Python requests 也是 HTTP 内容的一个很好的候选者。恕我直言，它有一个更好的 api，来自其官方文档的 http 请求示例：

>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...

Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:

>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...

回复收藏 0 原文

羞稚 2024-08-30 17:27:41

urllib2 在每个 Python 安装中都可以找到，因此是一个很好的入门基础。
PycURL 对于已经习惯使用 libcurl 的人来说非常有用，它公开了更多 HTTP 的低级细节，此外它还获得了应用于 libcurl 的任何修复或改进。
mechanize 用于持久驱动连接，就像浏览器一样。

这不是一个比另一个更好的问题，而是为工作选择合适的工具的问题。

回复收藏 0 原文

双手揣兜 2024-08-30 17:27:41

要“获取一些网页”，请使用请求！

来自 http://docs.python-requests.org/en/latest/ ：

Python的标准urllib2模块提供了大部分HTTP
您需要的功能，但 API 已彻底损坏。它被建造了
不同的时间和不同的网络。它需要巨大的
执行最简单的工作量（甚至方法覆盖）
任务。
事情不应该是这样的。不在 Python 中。

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

To "get some webpages", use requests!

From http://docs.python-requests.org/en/latest/ :

Python’s standard urllib2 module provides most of the HTTP
capabilities you need, but the API is thoroughly broken. It was built
for a different time — and a different web. It requires an enormous
amount of work (even method overrides) to perform the simplest of
tasks.
Things shouldn’t be this way. Not in Python.

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

回复收藏 0 原文

冬天旳寂寞 2024-08-30 17:27:41

不用担心“最后更新”。 HTTP 在过去几年中没有发生太大变化;)

urllib2 是最好的（因为它是内置的），如果您需要来自 Firefox 的 cookie，则切换到 mechanize。 mechanize 可以用作 urllib2 的直接替代品 - 它们具有类似的方法等。使用 Firefox cookie 意味着您可以使用您的个人登录凭据从网站（例如 StackOverflow）获取内容。只需对您的请求数量负责（否则您将被阻止）。

PycURL 适合那些需要 libcurl 中所有低级内容的人。我会先尝试其他库。

回复收藏 0 原文