使用 Python 获取维基百科文章

发布于 2024-07-06 02:39:53 字数 568 浏览 10 评论 0 原文

我尝试使用 Python 的 urllib 获取维基百科文章:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

但是,我得到的不是 html 页面,而是以下响应: 错误 - 维基媒体基金会:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT 

维基百科似乎阻止了不是来自标准浏览器的请求。

有人知道如何解决这个问题吗?

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT 

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

黯淡〆 2024-07-13 02:39:54

如果您尝试访问维基百科内容(并且不需要有关页面本身的任何特定信息),您应该使用“action=raw”调用index.php来获取维基文本,而不是使用API​​,例如in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

或者,如果您需要 HTML 代码,请使用 'action=render'就像:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

您还可以定义一个部分来仅获取部分内容类似于“section=3”。

然后,您可以使用 urllib2 模块访问它(如所选答案中所建议的)。
但是,如果您需要有关页面本身的信息(例如修订版),那么最好使用上面建议的 mwclient。

如果您需要更多信息,请参阅 MediaWiki 常见问题解答

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer).
However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

小清晰的声音 2024-07-13 02:39:54

我对任何站点使用的一般解决方案是使用 Firefox 访问页面,并使用 Firebug 等扩展程序记录 HTTP 请求的所有详细信息,包括任何 cookie。

在您的程序中(本例中为 Python),您应该尝试发送与 Firefox 中的请求类似的 HTTP 请求。 这通常包括设置 User-Agent、Referer 和 Cookie 字段,但也可能有其他字段。

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

七度光 2024-07-13 02:39:54

您不需要模拟浏览器用户代理; 任何用户代理都可以工作,只是不能是空白的。

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

悲念泪 2024-07-13 02:39:54

尝试将您在请求中发送的用户代理标头更改为:
用户代理:Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Try changing the user agent header you are sending in your request to something like:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

星星的轨迹 2024-07-13 02:39:54

请求太棒了!

以下是如何使用 requests 获取 html 内容:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

完成!

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

潦草背影 2024-07-13 02:39:54

使用 ?printable=yes 请求页面可以为您提供整个相对干净的 HTML 文档。 ?action=render 只为您提供正文 HTML 。 请求通过 MediaWiki 操作 API 使用 action=parse 同样只为您提供正文 HTML,但如果您想要更好的控制,那么会很好,查看解析 API 帮助

如果您只需要页面 HTML 以便可以呈现它,那么使用新的 更快更好RESTBase API,返回页面的缓存 HTML 表示形式。 在这种情况下,https://en.wikipedia.org/api/rest_v1/page /html/阿尔伯特_爱因斯坦

自 2015 年 11 月起,您不必设置用户代理,但它强烈鼓励。 此外,几乎所有 Wikimedia wiki 都需要 HTTPS ,因此请避免 301 重定向并发出 https 请求。

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

第七度阳光i 2024-07-13 02:39:54
import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

这似乎对我有用,无需更改用户代理。 如果没有“action=raw”,它对我不起作用。

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

╰沐子 2024-07-13 02:39:53

您应该考虑使用他们的高级 API,而不是试图欺骗 Wikipedia。

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

心清如水 2024-07-13 02:39:53

它不是针对具体问题的解决方案。 但使用 mwclient 库可能会让您感兴趣 (http://botwiki.sno.cc/ wiki/Python:Mwclient) 代替。 那会容易得多。 特别是因为您将直接获取文章内容,从而无需解析 html。

我自己已经在两个项目中使用过它,效果非常好。

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

风追烟花雨 2024-07-13 02:39:53

您需要使用取代 urllib2 docs.python.org/lib/module-urllib.html" rel="noreferrer">urllib 在 python std 库 以更改用户代理。

直接来自示例

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文