从维基百科中随机提取页面时,脚本总是收到 302 响应
我可以从 wikipedia 中提取任何页面,
import httplib
conn = httplib.HTTPConnection("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()
正常响应是
reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close
但是如果我尝试使用 /wiki/Special:Random 提取随机页面,我会得到 302 响应和空页面
reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close
如何获得非空随机页面?
I can pull a any page from wikipedia with
import httplib
conn = httplib.HTTPConnection("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()
The normal response will be
reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close
But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page
reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close
How do I get a non-empty random page?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
302是一个重定向。它在以下行中告诉您要去哪里:
您只需遵循重定向即可。
The 302 is a redirect. It's telling you where to go in the following line:
You just need to follow the redirect.
当您重定向时,响应对象将具有代码 302,并且
geturl()
方法将报告重定向 URL。 Python 的标准 HTTP 库使得默认处理重定向变得非常重要。帮自己一个忙,不要为这些东西烦恼并使用第 3 方 mechanize 库,这是urllib2 的直接替代品。使用 mechanize,您的代码将如下所示:
When you're redirected the response object is going to have a code of 302 and the
geturl()
method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement forurllib2
.Using mechanize, your code would look like this:
HTTP 代码 302 表示您正在被重定向。如果您查看 Location 标头,您将看到应在何处发出新请求。然后您可以向该 URL 发出请求,您将有望在该页面上得到 200。
澄清一下:要求您在其他地方重试该请求。这就是为什么您的客户端在收到 302 时需要发出另一个请求。维基百科的随机页面显然是通过在其数据库中选择一个随机页面来工作的,然后返回一个 302 响应,并将新页面作为位置字段。如果您查看其他 302 响应,我确信您会在“位置”字段中看到不同的页面。
HTTP code 302 means you are being redirected. If you look at the Location header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.
To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.
查看位置标题:
它说你应该重定向到该页面。读取该标头并向该页面发出另一个请求。
Look at the location header:
It says you should redirected to that page. Read that header and do another request to that page.