尝试在Python中使用urllib2访问互联网
我正在尝试编写一个程序,该程序将(除其他外)从预定网站获取文本或源代码。我正在学习 Python 来做到这一点,大多数资料都告诉我使用 urllib2。作为一个测试,我尝试了这段代码:
import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()
shell 没有以任何预期的方式运行,而是坐在那里,就像在等待某些输入一样。甚至没有“>>>”
或“...
”。退出此状态的唯一方法是使用 [ctrl]+c。当我这样做时,我收到一大堆错误消息,就像
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
我希望得到任何反馈一样。是否有与 urllib2
不同的工具可供使用,或者您能否提供有关如何解决此问题的建议。我在工作中使用网络计算机,并且我不完全确定 shell 是如何配置的或者这可能会如何影响任何事情。
I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2
. Just as a test, I tried this code:
import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()
Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>"
or "...
". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
I'd appreciate any feedback. Is there a different tool than urllib2
to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
有 99.999% 的可能性,这是一个代理问题。 Python 在检测要使用的正确 http 代理方面非常糟糕,当它找不到正确的代理时,它就会挂起并最终超时。
因此,首先您必须找出应使用哪个代理,检查浏览器的选项(IE 中的工具 -> Internet 选项 -> 连接 -> LAN 设置...等)。如果它使用脚本进行自动配置,您必须获取该脚本(应该是某种 JavaScript)并找出您的请求应该发送到的位置。如果没有指定脚本并且勾选了“自动确定”选项,您不妨询问公司的 IT 人员。
我假设您使用的是 Python 2.x。来自
urllib
上的 Python 文档:请注意,ProxyHandler 计算默认值的要点是使用
urlopen
时已经发生的情况,因此它可能不起作用。如果您确实想要 urllib2,则必须指定 ProxyHandler,如 此页面。可能需要也可能不需要身份验证(通常不需要)。
With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.
So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.
I assume you're using Python 2.x. From the Python docs on
urllib
:Note that the point on ProxyHandler figuring out default values is what happens already when you use
urlopen
, so it's probably not going to work.If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).
这不是“如何使用 urllib2 执行此操作”的好答案,但让我建议 Python 请求。它存在的全部原因是因为作者发现 urllib2 是一个难以处理的混乱。他可能是对的。
This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.
这很奇怪,你尝试过不同的网址吗?
另外还有HTTPLib,但它更复杂。这是使用 HTTPLib 的示例
That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib
我几乎立即收到 404 错误(没有挂起):
如果我尝试联系没有运行 HTTP 服务器的地址,它会挂起相当长一段时间,直到发生超时。您可以通过将超时参数传递给 urlopen 来缩短它:
I get a 404 error almost immediately (no hanging):
If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen: