如何使用维基百科的 API 获取维基百科内容？

静若繁花 2024-12-08 10:47:08

使用沙箱来测试API调用。

这些是关键参数。

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 指定仅返回前导部分。

请参阅此示例。

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

至获取 HTML，您可以类似地使用 action=parse

http://en.wikipedia.org /w/api.php?action=parse§ion=0&prop=text&page=pizza

请注意，您必须删除所有模板或信息框。

编辑：如果您想提取纯文本（没有wiki链接等），您可以使用TextExtracts API。使用那里的可用参数来调整您的输出。

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=pizza&explaintext=1&exsectionformat=plain

See this section in the MediaWiki API documentation, specifically involving getting the contents of the page.

use the sandbox to test the API call.

These are the key parameters.

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 specifies to only return the lead section.

See this example.

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

To get the HTML, you can use similarly use action=parse

http://en.wikipedia.org/w/api.php?action=parse§ion=0&prop=text&page=pizza

Note that you'll have to strip out any templates or infoboxes.

edit: If you want to extract the plain text (without wikilinks, etc), you can use the TextExtracts API. Use the available parameters there to adjust your output.

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=pizza&explaintext=1&exsectionformat=plain

回复收藏 0 原文

想挽留 2024-12-08 10:47:08

请参阅有吗维基百科 API 仅用于检索内容摘要？用于其他建议的解决方案。这是我建议的：

实际上有一个非常好的prop，名为摘录可与专门为此目的设计的查询一起使用。摘录允许您获取文章摘录（截断的文章文本）。有一个名为 exintro 的参数，可用于检索第零部分中的文本（没有图像或信息框等附加资源）。您还可以检索更细粒度的摘录，例如按一定数量的字符 (exchars) 或按一定数量的句子 (exsentences)，

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
和 API 沙箱 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 对此查询进行更多实验。

请注意，如果您特别想要第一段，您仍然需要获取第一个标签。然而，在此 API 调用中，没有需要解析的其他资产（例如图像）。如果您对此介绍摘要感到满意，您可以通过运行类似的函数来检索文本PHP 的 strip_tag 可以删除 HTML 标签。

回复收藏 0 原文

梅窗月明清似水 2024-12-08 10:47:08

我这样做：

https ://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

您得到的响应是一个包含数据的数组，易于解析

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

：得到只需第一段 limit=1 就是您所需要的。

I do it this way:

https://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

The response you get is an array with the data, easy to parse:

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

To get just the first paragraph limit=1 is what you need.

回复收藏 0 原文

从﹋此江山别 2024-12-08 10:47:08

要获取文章的第一段：

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

我创建了简短的 < a href="https://github.com/mudroljub/wikipedia-api-docs" rel="noreferrer">Wikipedia API 文档满足我自己的需求。有关于如何获取文章、图像和类似内容的工作示例。

回复收藏 0 原文

知足的幸福 2024-12-08 10:47:08

如果您需要对大量文章执行此操作，那么不要直接查询网站，而是考虑下载 Wikipedia 数据库转储，然后通过 API（例如 JWPL。

回复收藏 0 原文

困倦 2024-12-08 10:47:08

您可以通过查询 https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java。您只需要解析 JSON 文件，结果是已清理的纯文本，包括删除链接和引用。

回复收藏 0 原文

如此安好 2024-12-08 10:47:08

<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>

<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>

回复收藏 0 原文

清醇 2024-12-08 10:47:08

为此，您可以使用摘要 REST 端点的 extract_html 字段：例如 https://en.wikipedia.org/api/rest_v1/page/summary/Cat。

注意：这样做的目的是通过删除大部分发音（在某些情况下主要是括号中的发音）来稍微简化内容。

回复收藏 0 原文

祁梦 2024-12-08 10:47:08

您可以直接下载维基百科数据库，并使用 Wiki Parser（一个独立的应用程序）将所有页面解析为 XML。第一段是生成的 XML 中的一个单独的节点。

或者，您可以从纯文本输出中提取第一段。

回复收藏 0 原文

飘逸的'云 2024-12-08 10:47:08

您可以使用 jQuery 来做到这一点。首先使用适当的参数创建 URL。检查此链接以了解参数的含义。然后使用 $.ajax() 方法检索文章。请注意，维基百科不允许跨源请求。这就是我们在请求中使用 dataType : jsonp 的原因。

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});

You can use jQuery to do that. First create the URL with appropriate parameters. Check this link to understand what the parameters mean. Then use the $.ajax() method to retrieve the articles. Note that Wikipedia does not allow cross origin request. That's why we are using dataType : jsonp in the request.

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});

回复收藏 0 原文

撩发小公举 2024-12-08 10:47:08

以下是转储法语和英语维基词典和维基百科的程序：

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

获取维基媒体数据的另一种方法是依赖 kiwix zim 转储。

Here is program that will dump french and english wiktionary and wikipedia:

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

Another approach to acquire wikimedia data is to rely on kiwix zim dumps.

回复收藏 0 原文

面如桃花 2024-12-08 10:47:08

假设keyword = "Batman" //您想要搜索的术语，使用：

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text§ion=0

以 JSON 格式从维基百科获取摘要/第一段。

Suppose keyword = "Batman" //Term you want to search, use:

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text§ion=0

To get summary/first paragraph from Wikipedia in JSON format.

回复收藏 0 原文

如何使用维基百科的 API 获取维基百科内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

如何使用维基百科的 API 获取维基百科内容？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（12）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实