从维基百科文章中提取第一段 (Python)

发布于 2024-10-08 02:38:06 字数 377 浏览 0 评论 0原文

如何使用 Python 从维基百科文章中提取第一段?

例如,对于阿尔伯特·爱因斯坦,这将是:

阿尔伯特·爱因斯坦(发音为/ˈælbərt) ˈaɪnstaɪn/;德语: [ˈalbɐt ˈaɪnʃtaɪn] ( 听); 1879 年 3 月 14 日 – 4 月 18 日 1955)是一位理论物理学家, 广泛传播的哲学家和作家 被认为是最 有影响力和标志性的科学家和 历代知识分子。一个 德国-瑞士诺贝尔奖获得者、爱因斯坦 通常被视为父亲 现代物理学。[2]他收到了 1921 年诺贝尔物理学奖” 为理论物理服务,以及 尤其是他的发现 光电效应定律”。[3]

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

Albert Einstein (pronounced /ˈælbərt
ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn]
( listen); 14 March 1879 – 18 April
1955) was a theoretical physicist,
philosopher and author who is widely
regarded as one of the most
influential and iconic scientists and
intellectuals of all time. A
German-Swiss Nobel laureate, Einstein
is often regarded as the father of
modern physics.[2] He received the
1921 Nobel Prize in Physics "for his
services to theoretical physics, and
especially for his discovery of the
law of the photoelectric effect".[3]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

北方。的韩爷 2024-10-15 02:38:06

我编写了一个 Python 库,旨在让这一切变得非常简单。请访问 Github 查看。

要安装它,请运行

$ pip install wikipedia

然后获取文章的第一段,只需使用 wikipedia.summary 函数。

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

印刷

阿尔伯特·爱因斯坦 (/ˈælbərt ˈaɪnstaɪn/; 德语:[ˈalbɐt ˈaɪnʃtaɪn] (
听); 1879年3月14日 – 1955年4月18日)是德国出生的
发展了广义相对论的理论物理学家,
现代物理学的两大支柱之一(与量子
力学)。最著名的是他的质能等价公式 E
= mc2(被称为“世界上最著名的方程”),他因“对
理论物理学,特别是他发现了
光电效应”。

至于其工作原理,wikipediaMediaWiki API 的移动前端扩展,它返回 Wikipedia 文章的移动友好版本。具体来说,通过传递参数 prop=extracts&exsectionformat=plain,MediaWiki 服务器将解析 Wikitext 并返回您所请求的文章的纯文本摘要,最多包括整个页面文本。它还接受参数 exchars。 >exsentences,毫不奇怪,它限制了 API 返回的字符和句子的数量。

I wrote a Python library that aims to make this very easy. Check it out at Github.

To install it, run

$ pip install wikipedia

Then to get the first paragraph of an article, just use the wikipedia.summary function.

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

prints

Albert Einstein (/ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪnʃtaɪn] (
listen); 14 March 1879 – 18 April 1955) was a German-born
theoretical physicist who developed the general theory of relativity,
one of the two pillars of modern physics (alongside quantum
mechanics). While best known for his mass–energy equivalence formula E
= mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to
theoretical physics, and especially for his discovery of the law of
the photoelectric effect".

As far as how it works, wikipedia makes a request to the Mobile Frontend Extension of the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters exchars and exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

煮茶煮酒煮时光 2024-10-15 02:38:06

不久前,我做了两个类来获取纯文本的维基百科文章。我知道它们不是最好的解决方案,但您可以根据您的需求进行调整:

    wikipedia .py
    wiki2plain.py

你可以像这样使用它:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text

Some time ago I made two classes for get Wikipedia articles in plain text. I know that they aren't the best solution, but you can adapt it to your needs:

    wikipedia.py
    wiki2plain.py

You can use it like this:

from wikipedia import Wikipedia
from wiki2plain import Wiki2Plain

lang = 'simple'
wiki = Wikipedia(lang)

try:
    raw = wiki.article('Uruguay')
except:
    raw = None

if raw:
    wiki2plain = Wiki2Plain(raw)
    content = wiki2plain.text
难忘№最初的完美 2024-10-15 02:38:06

Wikipedia 运行一个 MediaWiki 扩展,它以 API 模块的形式提供了此功能。 TextExtracts 实现 action=query&prop=extracts以 HTML 或纯文本形式返回前 N 个句子和/或仅介绍的选项。

这是您想要进行的 API 调用,请尝试一下:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext =&redirects=&formatversion=2

  • action=query&prop=extracts 请求此信息
  • (ex)sentences=2, (ex)intro=, (ex)plaintext,是模块的参数(请参阅其 API 文档的第一个链接),要求以纯文本形式提供简介中的两个句子;对于 HTML,忽略后者。
  • redirects=(true) 因此,如果您请求“titles=Einstein”,您将获得阿尔伯特·爱因斯坦页面信息
  • formatversion=2,以获得更清晰的 UTF-8 格式。

有各种库可以调用 MediaWiki 操作 API,例如 DGund 的答案中的库,但自己调用 API 并不太难。

搜索结果中的页面信息讨论获取此文本摘录以及获取说明和引导图像对于文章。

Wikipedia runs a MediaWiki extension that provides exactly this functionality as an API module. TextExtracts implements action=query&prop=extracts with options to return the first N sentences and/or just the introduction, as HTML or plain text.

Here's the API call you want to make, try it:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

  • action=query&prop=extracts to request this info
  • (ex)sentences=2, (ex)intro=, (ex)plaintext, are parameters to the module (see the first link for its API doc) asking for two sentences from the intro as plain text; leave off the latter for HTML.
  • redirects=(true) so if you ask for "titles=Einstein" you'll get the Albert Einstein page info
  • formatversion=2 for a cleaner format in UTF-8.

There are various libraries that wrap invoking the MediaWiki action API, such as the one in DGund's answer, but it's not too hard to make the API calls yourself.

Page info in search results discusses getting this text extract, along with getting a description and lead image for articles.

笑,眼淚并存 2024-10-15 02:38:06

我所做的是这样的:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p

What I did is this:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

article= "Albert Einstein"
article = urllib.quote(article)

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

resource = opener.open("http://en.wikipedia.org/wiki/" + article)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.find('div',id="bodyContent").p
醉生梦死 2024-10-15 02:38:06

相对较新的 REST API 有一个非常适合此目的的 summary 方法使用,并执行这里其他答案中提到的很多事情(例如删除wikicode)。如果适用的话,它甚至包括图像和地理坐标。

使用可爱的 requests 模块和 Python 3:

import requests
r = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Amsterdam")
page = r.json()
print(page["extract"]) # Returns 'Amsterdam is the capital and...'

The relatively new REST API has a summary method that is perfect for this use, and does a lot of the things mentioned in the other answers here (e.g. removing wikicode). It even includes an image and geocoordinates if applicable.

Using the lovely requests module and Python 3:

import requests
r = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Amsterdam")
page = r.json()
print(page["extract"]) # Returns 'Amsterdam is the capital and...'
莫言歌 2024-10-15 02:38:06

首先,我保证我不会尖酸刻薄。

这是之前可能有用的问题:
使用 Python 获取维基百科文章

在此有人建议使用维基百科高级 API ,这就引出了这个问题:

有维基百科 API 吗?

First, I promise I am not being snarky.

Here's a previous question that might be of use:
Fetch a Wikipedia article with Python

In this someone suggests using the wikipedia high level API, which leads to this question:

Is there a Wikipedia API?

烟沫凡尘 2024-10-15 02:38:06

如果您需要库建议,BeautifulSoupurllib2
之前回答过:使用 Python 进行网页抓取

我尝试过 urllib2 从维基百科获取页面。但是,这是 403(禁止)。 MediaWiki为维基百科提供API,支持各种输出格式。我没有使用过 python-wikitools,但可能值得一试。 http://code.google.com/p/python-wikitools/

If you want library suggestions, BeautifulSoup, urllib2 come to mind.
Answered on SO before: Web scraping with Python.

I have tried urllib2 to get a page from Wikipedia. But, it was 403 (forbidden). MediaWiki provides API for Wikipedia, supporting various output formats. I haven't used python-wikitools, but may be worth a try. http://code.google.com/p/python-wikitools/

瑕疵 2024-10-15 02:38:06

正如其他人所说,一种方法是使用 wikimedia API 和 urllib 或 urllib2。下面的代码片段是我用来提取所谓的“引导”部分的一部分,其中包含文章摘要和信息框。这将检查返回的文本是否是重定向而不是实际内容,并且还可以让您跳过信息框(如果存在)(在我的情况下,我使用不同的代码来拉出并格式化信息框。

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw§ion=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

您将返回原始文本,包括wiki 标记,因此您需要进行一些清理,如果您只想要第一段,而不是整个第一部分,请查找第一个换行符。

As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw§ion=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

疯了 2024-10-15 02:38:06

尝试结合使用 urllib 来获取站点,并使用 BeautifulSouplxml 来解析数据。

Try a combination of urllib to fetch the site and BeautifulSoup or lxml to parse the data.

り繁华旳梦境 2024-10-15 02:38:06

尝试模式

pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string

Try pattern.

pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文