当前位置：文江博客话题详情

使用 Python 获取维基百科文章

发布于 2024-07-06 02:39:53 字数 568 浏览 10 评论 0 原文

我尝试使用 Python 的 urllib 获取维基百科文章：

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

但是，我得到的不是 html 页面，而是以下响应：错误 - 维基媒体基金会：

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

维基百科似乎阻止了不是来自标准浏览器的请求。

有人知道如何解决这个问题吗？

原文

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黯淡〆 2024-07-13 02:39:54

如果您尝试访问维基百科内容（并且不需要有关页面本身的任何特定信息），您应该使用“action=raw”调用index.php来获取维基文本，而不是使用API，例如in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

或者，如果您需要 HTML 代码，请使用 'action=render'就像：

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

您还可以定义一个部分来仅获取部分内容类似于“section=3”。

然后，您可以使用 urllib2 模块访问它（如所选答案中所建议的）。
但是，如果您需要有关页面本身的信息（例如修订版），那么最好使用上面建议的 mwclient。

如果您需要更多信息，请参阅 MediaWiki 常见问题解答。

回复收藏 0 原文

小清晰的声音 2024-07-13 02:39:54

我对任何站点使用的一般解决方案是使用 Firefox 访问页面，并使用 Firebug 等扩展程序记录 HTTP 请求的所有详细信息，包括任何 cookie。

在您的程序中（本例中为 Python），您应该尝试发送与 Firefox 中的请求类似的 HTTP 请求。这通常包括设置 User-Agent、Referer 和 Cookie 字段，但也可能有其他字段。

回复收藏 0 原文

七度光 2024-07-13 02:39:54

您不需要模拟浏览器用户代理；任何用户代理都可以工作，只是不能是空白的。

回复收藏 0 原文

悲念泪 2024-07-13 02:39:54

尝试将您在请求中发送的用户代理标头更改为：
用户代理：Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

回复收藏 0 原文

星星的轨迹 2024-07-13 02:39:54

请求太棒了！

以下是如何使用 requests 获取 html 内容：

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

完成！

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

回复收藏 0 原文

潦草背影 2024-07-13 02:39:54

使用 ?printable=yes 请求页面可以为您提供整个相对干净的 HTML 文档。 ?action=render 只为您提供正文 HTML 。请求通过 MediaWiki 操作 API 使用 action=parse 同样只为您提供正文 HTML，但如果您想要更好的控制，那么会很好，查看解析 API 帮助。

如果您只需要页面 HTML 以便可以呈现它，那么使用新的更快更好RESTBase API，返回页面的缓存 HTML 表示形式。在这种情况下，https://en.wikipedia.org/api/rest_v1/page /html/阿尔伯特_爱因斯坦。

自 2015 年 11 月起，您不必设置用户代理，但它强烈鼓励。此外，几乎所有 Wikimedia wiki 都需要 HTTPS ，因此请避免 301 重定向并发出 https 请求。

回复收藏 0 原文

第七度阳光i 2024-07-13 02:39:54

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

这似乎对我有用，无需更改用户代理。如果没有“action=raw”，它对我不起作用。

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

回复收藏 0 原文

╰沐子 2024-07-13 02:39:53

您应该考虑使用他们的高级 API，而不是试图欺骗 Wikipedia。

回复收藏 0 原文

心清如水 2024-07-13 02:39:53

它不是针对具体问题的解决方案。但使用 mwclient 库可能会让您感兴趣 (http://botwiki.sno.cc/ wiki/Python:Mwclient) 代替。那会容易得多。特别是因为您将直接获取文章内容，从而无需解析 html。

我自己已经在两个项目中使用过它，效果非常好。

回复收藏 0 原文

风追烟花雨 2024-07-13 02:39:53

您需要使用取代 urllib2 docs.python.org/lib/module-urllib.html" rel="noreferrer">urllib 在 python std 库以更改用户代理。

直接来自示例

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

回复收藏 0 原文

~没有更多了~

关于作者

冷血

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

使用 Python 获取维基百科文章

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

使用 Python 获取维基百科文章

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（10）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。