如何使用维基百科的 API 获取维基百科内容?

发布于 2024-12-01 10:47:08 字数 47 浏览 2 评论 0 原文

我想获取维基百科文章的第一段。

执行此操作的 API 查询是什么?

I want to get the first paragraph of a Wikipedia article.

What is the API query to do so?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

静若繁花 2024-12-08 10:47:08

请参阅 MediaWiki API 文档中的此部分,特别涉及 获取页面内容

使用沙箱来测试API调用。

这些是关键参数。

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 指定仅返回前导部分。

请参阅此示例。

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

至获取 HTML,您可以类似地使用 action=parse

请注意,您必须删除所有模板或信息框。

编辑:如果您想提取纯文本(没有wiki链接等),您可以使用TextExtracts API。使用那里的可用参数来调整您的输出。

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=pizza&explaintext=1&exsectionformat=plain

See this section in the MediaWiki API documentation, specifically involving getting the contents of the page.

use the sandbox to test the API call.

These are the key parameters.

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 specifies to only return the lead section.

See this example.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

To get the HTML, you can use similarly use action=parse

Note that you'll have to strip out any templates or infoboxes.

edit: If you want to extract the plain text (without wikilinks, etc), you can use the TextExtracts API. Use the available parameters there to adjust your output.

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=pizza&explaintext=1&exsectionformat=plain
想挽留 2024-12-08 10:47:08

请参阅有吗维基百科 API 仅用于检索内容摘要? 用于其他建议的解决方案。这是我建议的:

实际上有一个非常好的prop,名为摘录可与专门为此目的设计的查询一起使用。摘录允许您获取文章摘录(截断的文章文本)。有一个名为 exintro 的参数,可用于检索第零部分中的文本(没有图像或信息框等附加资源)。您还可以检索更细粒度的摘录,例如按一定数量的字符 (exchars) 或按一定数量的句子 (exsentences),

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
API 沙箱 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 对此查询进行更多实验。

请注意,如果您特别想要第一段,您仍然需要获取第一个标签。然而,在此 API 调用中,没有需要解析的其他资产(例如图像)。如果您对此介绍摘要感到满意,您可以通过运行类似 的函数来检索文本PHP 的 strip_tag 可以删除 HTML 标签。

See Is there a Wikipedia API just for retrieve the content summary? for other proposed solutions. Here is one that I suggested:

There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose. Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences(exsentences)

Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.

Please note that if you want the first paragraph specifically you still need to get the first tag. However in this API call there are no additional assets like images to parse. If you are satisfied with this introduction summary you can retrieve the text by running a function like PHP's strip_tag that remove the HTML tags.

梅窗月明清似水 2024-12-08 10:47:08

我这样做:

https ://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

您得到的响应是一个包含数据的数组,易于解析

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

:得到只需第一段 limit=1 就是您所需要的。

I do it this way:

https://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

The response you get is an array with the data, easy to parse:

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

To get just the first paragraph limit=1 is what you need.

从﹋此江山别 2024-12-08 10:47:08

要获取文章的第一段:

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

我创建了简短的 < a href="https://github.com/mudroljub/wikipedia-api-docs" rel="noreferrer">Wikipedia API 文档 满足我自己的需求。有关于如何获取文章、图像和类似内容的工作示例。

To GET first paragraph of an article:

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

I have created short Wikipedia API docs for my own needs. There are working examples on how to get article(s), image(s) and similar.

知足的幸福 2024-12-08 10:47:08

如果您需要对大量文章执行此操作,那么不要直接查询网站,而是考虑下载 Wikipedia 数据库转储,然后通过 API(例如 JWPL

If you need to do this for a large number of articles, then instead of querying the website directly, consider downloading a Wikipedia database dump and then accessing it through an API such as JWPL.

困倦 2024-12-08 10:47:08

您可以通过查询 https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java。您只需要解析 JSON 文件,结果是已清理的纯文本,包括删除链接和引用。

You can get the introduction of the article in Wikipedia by querying pages such as https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java. You just need to parse the JSON file and the result is plain text which has been cleaned including removing links and references.

如此安好 2024-12-08 10:47:08
<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>
<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>
清醇 2024-12-08 10:47:08

为此,您可以使用摘要 REST 端点的 extract_html 字段:例如 https://en.wikipedia.org/api/rest_v1/page/summary/Cat

注意:这样做的目的是通过删除大部分发音(在某些情况下主要是括号中的发音)来稍微简化内容​​。

You can use the extract_html field of the summary REST endpoint for this: e.g. https://en.wikipedia.org/api/rest_v1/page/summary/Cat.

Note: This aims to simply the content a bit by removing most of the pronunciations, mainly in parentheses in some cases.

祁梦 2024-12-08 10:47:08

您可以直接下载维基百科数据库,并使用 Wiki Parser(一个独立的应用程序)将所有页面解析为 XML。第一段是生成的 XML 中的一个单独的节点。

或者,您可以从纯文本输出中提取第一段。

You can download the Wikipedia database directly and parse all pages to XML with Wiki Parser, which is a standalone application. The first paragraph is a separate node in the resulting XML.

Alternatively, you can extract the first paragraph from its plain-text output.

飘逸的'云 2024-12-08 10:47:08

您可以使用 jQuery 来做到这一点。首先使用适当的参数创建 URL。检查此链接以了解参数的含义。然后使用 $.ajax() 方法检索文章。请注意,维基百科不允许跨源请求。这就是我们在请求中使用 dataType : jsonp 的原因。

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});

You can use jQuery to do that. First create the URL with appropriate parameters. Check this link to understand what the parameters mean. Then use the $.ajax() method to retrieve the articles. Note that Wikipedia does not allow cross origin request. That's why we are using dataType : jsonp in the request.

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});
撩发小公举 2024-12-08 10:47:08

以下是转储法语和英语维基词典和维基百科的程序:

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

获取维基媒体数据的另一种方法是依赖 kiwix zim 转储。

Here is program that will dump french and english wiktionary and wikipedia:

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

Another approach to acquire wikimedia data is to rely on kiwix zim dumps.

面如桃花 2024-12-08 10:47:08

假设keyword = "Batman" //您想要搜索的术语,使用:

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text§ion=0

以 JSON 格式从维基百科获取摘要/第一段。

Suppose keyword = "Batman" //Term you want to search, use:

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text§ion=0

To get summary/first paragraph from Wikipedia in JSON format.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文