从维基百科中随机提取页面时,脚本总是收到 302 响应

发布于 2024-11-02 17:04:14 字数 1665 浏览 0 评论 0原文

我可以从 wikipedia 中提取任何页面,

import httplib
conn = httplib.HTTPConnection("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()

正常响应是

reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close

但是如果我尝试使用 /wiki/Special:Random 提取随机页面,我会得到 302 响应和空页面

reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close

如何获得非空随机页面?

I can pull a any page from wikipedia with

import httplib
conn = httplib.HTTPConnection("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()

The normal response will be

reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close

But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page

reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close

How do I get a non-empty random page?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

情绪 2024-11-09 17:04:14

302是一个重定向。它在以下行中告诉您要去哪里:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust 

您只需遵循重定向即可。

The 302 is a redirect. It's telling you where to go in the following line:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust 

You just need to follow the redirect.

蓝天白云 2024-11-09 17:04:14

当您重定向时,响应对象将具有代码 302,并且 geturl() 方法将报告重定向 URL。 Python 的标准 HTTP 库使得默认处理重定向变得非常重要。帮自己一个忙,不要为这些东西烦恼并使用第 3 方 mechanize 库,这是urllib2 的直接替代品。

使用 mechanize,您的代码将如下所示:

import httplib
import mechanize

host = 'en.wikipedia.org'
path = '/wiki/Special:Random'
url = 'http://' + host + path # We have to pass a http:// url

# It still uses httplib.HTTPConnection, so we can debug
httplib.HTTPConnection.debuglevel = 1

request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) 
response = mechanize.urlopen(request)

print response.code
# => 200
print response.geturl()
# => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship'
data = response.read()

When you're redirected the response object is going to have a code of 302 and the geturl() method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement for urllib2.

Using mechanize, your code would look like this:

import httplib
import mechanize

host = 'en.wikipedia.org'
path = '/wiki/Special:Random'
url = 'http://' + host + path # We have to pass a http:// url

# It still uses httplib.HTTPConnection, so we can debug
httplib.HTTPConnection.debuglevel = 1

request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) 
response = mechanize.urlopen(request)

print response.code
# => 200
print response.geturl()
# => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship'
data = response.read()
表情可笑 2024-11-09 17:04:14

HTTP 代码 302 表示您正在被重定向。如果您查看 Location 标头,您将看到应在何处发出新请求。然后您可以向该 URL 发出请求,您将有望在该页面上得到 200。

澄清一下:要求您在其他地方重试该请求。这就是为什么您的客户端在收到 302 时需要发出另一个请求。维基百科的随机页面显然是通过在其数据库中选择一个随机页面来工作的,然后返回一个 302 响应,并将新页面作为位置字段。如果您查看其他 302 响应,我确信您会在“位置”字段中看到不同的页面。

HTTP code 302 means you are being redirected. If you look at the Location header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.

To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.

瘫痪情歌 2024-11-09 17:04:14

查看位置标题:

标题:位置:http://en.wikipedia.org/wiki/Tuticorin_Port_Trust< /p>

它说你应该重定向到该页面。读取该标头并向该页面发出另一个请求。

Look at the location header:

header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust

It says you should redirected to that page. Read that header and do another request to that page.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文