尝试在Python中使用urllib2访问互联网

发布于 2024-12-25 04:53:49 字数 832 浏览 2 评论 0原文

我正在尝试编写一个程序，该程序将（除其他外）从预定网站获取文本或源代码。我正在学习 Python 来做到这一点，大多数资料都告诉我使用 urllib2。作为一个测试，我尝试了这段代码：

import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()

shell 没有以任何预期的方式运行，而是坐在那里，就像在等待某些输入一样。甚至没有“>>>” 或“...”。退出此状态的唯一方法是使用 [ctrl]+c。当我这样做时，我收到一大堆错误消息，就像

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)

我希望得到任何反馈一样。是否有与 urllib2 不同的工具可供使用，或者您能否提供有关如何解决此问题的建议。我在工作中使用网络计算机，并且我不完全确定 shell 是如何配置的或者这可能会如何影响任何事情。

原文

I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2. Just as a test, I tried this code:

import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()

Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>" or "...". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)

I'd appreciate any feedback. Is there a different tool than urllib2 to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森末i 2025-01-01 04:53:49

有 99.999% 的可能性，这是一个代理问题。 Python 在检测要使用的正确 http 代理方面非常糟糕，当它找不到正确的代理时，它就会挂起并最终超时。

因此，首先您必须找出应使用哪个代理，检查浏览器的选项（IE 中的工具 -> Internet 选项 -> 连接 -> LAN 设置...等）。如果它使用脚本进行自动配置，您必须获取该脚本（应该是某种 JavaScript）并找出您的请求应该发送到的位置。如果没有指定脚本并且勾选了“自动确定”选项，您不妨询问公司的 IT 人员。

我假设您使用的是 Python 2.x。来自 urllib 上的 Python 文档：

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

请注意，ProxyHandler 计算默认值的要点是使用 urlopen 时已经发生的情况，因此它可能不起作用。

如果您确实想要 urllib2，则必须指定 ProxyHandler，如此页面。可能需要也可能不需要身份验证（通常不需要）。

With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.

So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.

I assume you're using Python 2.x. From the Python docs on urllib :

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen, so it's probably not going to work.

If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).

回复收藏 0 原文

负佳期 2025-01-01 04:53:49

这不是“如何使用 urllib2 执行此操作”的好答案，但让我建议 Python 请求。它存在的全部原因是因为作者发现 urllib2 是一个难以处理的混乱。他可能是对的。

回复收藏 0 原文

醉酒的小男人 2025-01-01 04:53:49

这很奇怪，你尝试过不同的网址吗？

另外还有HTTPLib，但它更复杂。这是使用 HTTPLib 的示例

import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
    html = response.read()

That is very weird, have you tried a different URL?

Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib

import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
    html = response.read()

回复收藏 0 原文

盗琴音 2025-01-01 04:53:49

我几乎立即收到 404 错误（没有挂起）：

>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
  ...
urllib2.HTTPError: HTTP Error 404: Not Found

如果我尝试联系没有运行 HTTP 服务器的地址，它会挂起相当长一段时间，直到发生超时。您可以通过将超时参数传递给 urlopen 来缩短它：

>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
  ...
urllib2.URLError: <urlopen error timed out>

I get a 404 error almost immediately (no hanging):

>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
  ...
urllib2.HTTPError: HTTP Error 404: Not Found

If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:

>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
  ...
urllib2.URLError: <urlopen error timed out>

回复收藏 0 原文

~没有更多了~