是否有专门用于检索内容摘要的维基百科 API?

发布于 2024-12-21 22:48:13 字数 186 浏览 4 评论 0原文

我只需要检索维基百科页面的第一段。

内容必须采用 HTML 格式,可以在我的网站上显示(所以 BBCode< /a>,或维基百科特殊代码!)

I need just to retrieve the first paragraph of a Wikipedia page.

Content must be HTML formatted, ready to be displayed on my website (so no BBCode, or Wikipedia special code!)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

怀里藏娇 2024-12-28 22:48:13

有一种方法无需任何 HTML 解析即可获取整个“介绍部分”!类似于 AnthonyS 的回答 通过附加 explaintext 参数,您可以获得纯文本的介绍部分文本。

查询

获取 Stack Overflow 的纯文本介绍:

使用页面标题:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

或者使用pageids

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

JSON响应

(删除警告)

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

文档:API:query/prop=extracts

There's a way to get the entire "introduction section" without any HTML parsing! Similar to AnthonyS's answer with an additional explaintext parameter, you can get the introduction section text in plain text.

Query

Getting Stack Overflow's introduction in plain text:

Using the page title:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

Or use pageids:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

JSON Response

(warnings stripped)

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

Documentation: API: query/prop=extracts

百思不得你姐 2024-12-28 22:48:13

实际上有一个非常好的prop,名为extracts< /a> 可以与专门为此目的设计的查询一起使用。

摘录允许您获取文章摘录(截断的文章文本)。有一个名为 exintro 的参数,可用于检索第零部分中的文本(没有图像或信息框等附加资源)。您还可以检索更细粒度的摘录,例如按一定数量的字符 (exchars) 或按一定数量的句子 (exsentences)。

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
API 沙箱 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 对此查询进行更多实验。

请注意,如果您特别想要第一段,您仍然需要按照所选答案中的建议进行一些额外的解析。此处的区别在于,此查询返回的响应比建议的其他一些 API 查询短,因为您没有需要解析的 API 响应中的其他资产(例如图像)。

文档中的警告:

我们不建议使用exsentences。它不适用于 HTML 提取,并且有许多边缘情况它不存在。例如“陆军上将约翰·史密斯是一名士兵。”将被视为 4 个句子。我们不打算解决这个问题。

There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose.

Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences (exsentences).

Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.

Please note that, if you want the first paragraph specifically, you still need to do some additional parsing as suggested in the chosen answer. The difference here is that the response returned by this query is shorter than some of the other API queries suggested, because you don't have additional assets such as images in the API response to parse.

Caveat from the docs:

We do not recommend the usage of exsentences. It does not work for HTML extracts and there are many edge cases for which it doesn't exist. For example "Arm. gen. Ing. John Smith was a soldier." will be treated as 4 sentences. We do not plan to fix this.

在巴黎塔顶看东京樱花 2024-12-28 22:48:13

自 2017 年起,维基百科提供了具有更好缓存的 REST API。在文档中,您可以找到以下完全适合您的 API用例(因为它被新的页面预览使用 特征)。

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow
返回以下数据,可用于显示带有小缩略图的摘要:

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "<span class=\"mw-page-title-main\">Stack Overflow</span>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "<span class=\"mw-page-title-main\">Stack Overflow</span>"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png/320px-StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 320,
    "height": 144
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 1920,
    "height": 865
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "1136271608",
  "tid": "a5580980-9fe9-11ed-8bcd-ff7b011c142c",
  "timestamp": "2023-01-29T15:28:54Z",
  "description": "Website hosting questions and answers on a wide range of topics in computer programming",
  "description_source": "local",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "extract": "Stack Overflow is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.</p>"
}

默认情况下,它遵循重定向(因此 /api/rest_v1/page/summary/StackOverflow 也可以工作),但这可以使用 ?redirect=false 禁用。

如果您需要从另一个域访问 API,您可以使用 CORS 标头设置 <代码>&origin=(例如,&origin=*)。

截至 2019 年:API 似乎返回了有关页面的更多有用信息。

Since 2017 Wikipedia provides a REST API with better caching. In the documentation you can find the following API which perfectly fits your use case (as it is used by the new Page Previews feature).

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow
returns the following data which can be used to display a summary with a small thumbnail:

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "<span class=\"mw-page-title-main\">Stack Overflow</span>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "<span class=\"mw-page-title-main\">Stack Overflow</span>"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png/320px-StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 320,
    "height": 144
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/a/a5/StackOverflow.com_Top_Questions_Page_Screenshot.png",
    "width": 1920,
    "height": 865
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "1136271608",
  "tid": "a5580980-9fe9-11ed-8bcd-ff7b011c142c",
  "timestamp": "2023-01-29T15:28:54Z",
  "description": "Website hosting questions and answers on a wide range of topics in computer programming",
  "description_source": "local",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "extract": "Stack Overflow is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer website for professional and enthusiast programmers. It is the flagship site of the Stack Exchange Network. It was created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer websites such as Experts-Exchange. Stack Overflow was sold to Prosus, a Netherlands-based consumer internet conglomerate, on 2 June 2021 for $1.8 billion.</p>"
}

By default, it follows redirects (so that /api/rest_v1/page/summary/StackOverflow also works), but this can be disabled with ?redirect=false.

If you need to access the API from another domain you can set the CORS header with &origin= (e.g., &origin=*).

As of 2019: The API seems to return more useful information about the page.

卖梦商人 2024-12-28 22:48:13

此代码允许您以纯文本形式检索页面第一段的内容。

这个答案的部分内容来自此处,因此这里。有关详细信息,请参阅 MediaWiki API 文档

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in JSON format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text§ion=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // Get the main text content of the query (it's parsed HTML)

// Pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // Content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}

This code allows you to retrieve the content of the first paragraph of the page in plain text.

Parts of this answer come from here and thus here. See MediaWiki API documentation for more information.

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in JSON format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text§ion=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // Get the main text content of the query (it's parsed HTML)

// Pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // Content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}
橘寄 2024-12-28 22:48:13

是的,有。例如,如果您想获取文章第一部分的内容 Stack Overflow,请使用如下查询:

<一href="http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse ">http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

这些部分的含义如下:

  • format=xml:以 XML 形式返回结果格式化程序。其他选项(如 JSON)也可用。这不会影响页面内容本身的格式,只会影响包含的数据格式。

  • action=query&prop=revisions:获取有关页面修订的信息。由于我们没有指定哪个修订版本,因此使用最新版本。

  • titles=Stack%20Overflow:获取有关页面Stack Overflow的信息。如果用 | 分隔页面名称,则可以一次性获取多个页面的文本。

  • rvprop=content:返回修订版的内容(或文本)。

  • rvsection=0:仅返回第 0 部分的内容。

  • rvparse:返回解析为 HTML 的内容。

请记住,这会返回整个第一部分,包括帽注(“用于其他用途……”)、信息框或图像等内容。

有几个可用于各种语言的库,可以使 API 的使用变得更容易,如果您使用其中之一可能会更好。

Yes, there is. For example, if you wanted to get the content of the first section of the article Stack Overflow, use a query like this:

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

The parts mean this:

  • format=xml: Return the result formatter as XML. Other options (like JSON) are available. This does not affect the format of the page content itself, only the enclosing data format.

  • action=query&prop=revisions: Get information about the revisions of the page. Since we don't specify which revision, the latest one is used.

  • titles=Stack%20Overflow: Get information about the page Stack Overflow. It's possible to get the text of more pages in one go, if you separate their names by |.

  • rvprop=content: Return the content (or text) of the revision.

  • rvsection=0: Return only content from section 0.

  • rvparse: Return the content parsed as HTML.

Keep in mind that this returns the whole first section including things like hatnotes (“For other uses …”), infoboxes or images.

There are several libraries available for various languages that make working with API easier, it may be better for you if you used one of them.

空‖城人不在 2024-12-28 22:48:13

这是我现在正在制作的网站的代码,该网站需要获取维基百科文章的主要段落、摘要和第 0 节,并且这一切都是在浏览器(客户端 JavaScript)内完成的,这要归功于JSONP 的魔力! --> http://jsfiddle.net/gautamadude/HMJJg/1/

它使用维基百科 API获取 HTML 中的前导段落(称为第 0 节),如下所示: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text§ion=0&callback=?

然后它会去除 HTML 和其他不需要的数据,为您提供干净的文章摘要字符串。如果您愿意,可以稍作调整,在前导段落周围添加一个“p”HTML 标记,但现在它们之间只有一个换行符。

代码:

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

// Get leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text§ion=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            // Remove HTML comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            // Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') // Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); // Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); // Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});

This is the code I'm using right now for a website I'm making that needs to get the leading paragraphs, summary, and section 0 of off Wikipedia articles, and it's all done within the browser (client-side JavaScript) thanks to the magic of JSONP! --> http://jsfiddle.net/gautamadude/HMJJg/1/

It uses the Wikipedia API to get the leading paragraphs (called section 0) in HTML like so: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text§ion=0&callback=?

It then strips the HTML and other undesired data, giving you a clean string of an article summary. If you want you can, with a little tweaking, get a "p" HTML tag around the leading paragraphs, but right now there is just a newline character between them.

Code:

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

// Get leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text§ion=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            // Remove HTML comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            // Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') // Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); // Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); // Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});
又爬满兰若 2024-12-28 22:48:13

此 URL 将返回 XML 格式的摘要。

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

我创建了一个函数来从维基百科获取关键字的描述。

function getDescription($keyword) {
    $url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=' . urlencode($keyword) . '&MaxHits=1';
    $xml = simplexml_load_file($url);
    return $xml->Result->Description;
}

echo getDescription('agra');

This URL will return summary in XML format.

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

I have created a function to fetch description of a keyword from Wikipedia.

function getDescription($keyword) {
    $url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=' . urlencode($keyword) . '&MaxHits=1';
    $xml = simplexml_load_file($url);
    return $xml->Result->Description;
}

echo getDescription('agra');
浊酒尽余欢 2024-12-28 22:48:13

您还可以通过 DBPedia 获取诸如第一段之类的内容,它采用维基百科内容并从中创建结构化信息(RDF)并通过 API 提供此功能。 DBPedia API 是一种 SPARQL API(基于 RDF),但它输出 JSON 并且非常容易包装。

作为示例,这里有一个名为 WikipediaJS 的超级简单 JavaScript 库,它可以提取结构化内容,包括摘要第一段。

您可以在这篇博文中阅读更多相关信息:WikipediaJS - 通过 Javascript 访问维基百科文章数据

JavaScript 库代码可以在 wikipedia.js

You can also get content such as the first paragraph via DBPedia which takes Wikipedia content and creates structured information from it (RDF) and makes this available via an API. The DBPedia API is a SPARQL one (RDF-based), but it outputs JSON and it is pretty easy to wrap.

As an example here's a super simple JavaScript library named WikipediaJS that can extract structured content including a summary first paragraph.

You can read more about it in this blog post: WikipediaJS - accessing Wikipedia article data through Javascript

The JavaScript library code can be found in wikipedia.js.

你列表最软的妹 2024-12-28 22:48:13

abstract.xml.gz 转储 声音就像你想要的那样。

The abstract.xml.gz dump sounds like the one you want.

阪姬 2024-12-28 22:48:13

如果您只是查找文本(然后可以将其拆分),但不想使用 API,请查看 en.wikipedia.org/w/index.php?title=Elephant&action =原始

If you are just looking for the text, which you can then split up, but don't want to use the API, take a look at en.wikipedia.org/w/index.php?title=Elephant&action=raw.

寒尘 2024-12-28 22:48:13

我的方法如下(在 PHP 中):

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html 可能需要进一步清理,但基本上就是这样。

My approach was as follows (in PHP):

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html might need further cleaning, but that's basically it.

愿与i 2024-12-28 22:48:13

我尝试了迈克尔·拉帕达斯< /a> 和 @Krinkle 的解决方案,但就我而言,我很难根据大小写找到一些文章。就像这里:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin

注意,我用 exsentences=1 截断了响应,

显然“标题规范化”无法正常工作:

标题规范化将页面标题转换为其规范形式。这
表示第一个字符大写,下划线替换为
空格,并将命名空间更改为为此定义的本地化形式
维基百科。标题规范化是自动完成的,无论哪个
使用查询模块。但是,页面中的任何尾随换行符
标题 (\n) 会导致奇怪的行为,应该将其删除
首先。

我知道我可以轻松解决大小写问题,但也存在必须将对象转换为数组的不便。

因为我真的想要一个众所周知且定义的搜索的第一段(没有从其他文章中获取信息的风险),所以我这样做了:

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

请注意,在这种情况下,我使用 < 进行了截断code>limit=1

这样:

  1. 我可以非常轻松地访问响应数据。
  2. 反应很小。

但我们必须谨慎对待搜索的大小写。

更多信息:https://www.mediawiki.org/wiki/API:Opensearch

I tried Michael Rapadas' and @Krinkle's solutions, but in my case I had trouble to find some articles depending of the capitalization. Like here:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin

Note I truncated the response with exsentences=1

Apparently "title normalization" was not working correctly:

Title normalization converts page titles to their canonical form. This
means capitalizing the first character, replacing underscores with
spaces, and changing namespace to the localized form defined for that
wiki. Title normalization is done automatically, regardless of which
query modules are used. However, any trailing line breaks in page
titles (\n) will cause odd behavior and they should be stripped out
first.

I know I could have sorted out the capitalization issue easily, but there was also the inconvenience of having to cast the object to an array.

Because I just really wanted the very first paragraph of a well-known and defined search (no risk to fetch info from another articles), I did it like this:

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

Note in this case I did the truncation with limit=1

This way:

  1. I can access the response data very easily.
  2. The response is quite small.

But we have to keep being careful with the capitalization of our search.

More information: https://www.mediawiki.org/wiki/API:Opensearch

窗影残 2024-12-28 22:48:13

现在,维基媒体企业版提供了一种更简单的方法,其中包含 abstract 字段。 https://enterprise.wikimedia.com/docs/data-dictionary/#abstract< /a> 在 v2/articles 端点 https://enterprise.wikimedia.com/docs/on-demand/

There's a simpler way now with wikimedia enterprise with the abstract field. https://enterprise.wikimedia.com/docs/data-dictionary/#abstract in the v2/articles endpoint https://enterprise.wikimedia.com/docs/on-demand/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文