维基百科 API - 获取“背景信息”桌子?

发布于 2024-11-05 14:11:26 字数 224 浏览 1 评论 0原文

MediaWiki 是否提供了返回“背景信息”表中存在的信息的方法? (通常在文章页面的右侧)例如,我想从 Radiohead 获取 Origin:

http://en .wikipedia.org/wiki/Radiohead

或者我需要解析 html 页面吗?

Does MediaWiki provide a way to return the information present in 'Background Information' Table? (usually right of the article page) For example I would like to grab the Origin from Radiohead:

http://en.wikipedia.org/wiki/Radiohead

Or do I need to parse the html page?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

雨的味道风的声音 2024-11-12 14:11:26

您可以使用 revisions 属性以及rvgeneratexml 参数为文章生成解析树。然后您可以应用 XPath 或遍历它并查找所需的信息。

以下是示例代码:

$page = 'Radiohead';
$api_call_url = 'http://en.wikipedia.org/w/api.php?action=query&titles=' .
    urlencode( $page ) . '&prop=revisions&rvprop=content&rvgeneratexml=1&format=json';

您必须向 API 表明自己的身份,请参阅 Meta Wiki 来了解更多信息。

$user_agent = 'Your name <your email>';

$curl = curl_init();
curl_setopt_array( $curl, array(
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => $user_agent,
    CURLOPT_URL => $api_call_url,
) );
$response = json_decode( curl_exec( $curl ), true );
curl_close( $curl );

foreach( $response['query']['pages'] as $page ) {
    $parsetree = simplexml_load_string( $page['revisions'][0]['parsetree'] );

这里我们使用 XPath 来查找 Infobox 音乐艺术家 的参数 Origin 及其值。有关语法等,请参阅 XPath 规范。您也可以遍历树并手动查找节点。请随意研究解析树以更好地掌握它。

    $infobox_origin = $parsetree->xpath( '//template[contains(string(title),' .
        '"Infobox musical artist")]/part[contains(string(name),"Origin")]/value' );

    echo trim( strval( $infobox_origin[0] ) );
}

You can use the revisions property along with the rvgeneratexml parameter to generate a parse tree for the article. Then you can apply XPath or traverse it and look for the desired information.

Here's an example code:

$page = 'Radiohead';
$api_call_url = 'http://en.wikipedia.org/w/api.php?action=query&titles=' .
    urlencode( $page ) . '&prop=revisions&rvprop=content&rvgeneratexml=1&format=json';

You have to identify yourself to the API, see more on Meta Wiki.

$user_agent = 'Your name <your email>';

$curl = curl_init();
curl_setopt_array( $curl, array(
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_USERAGENT => $user_agent,
    CURLOPT_URL => $api_call_url,
) );
$response = json_decode( curl_exec( $curl ), true );
curl_close( $curl );

foreach( $response['query']['pages'] as $page ) {
    $parsetree = simplexml_load_string( $page['revisions'][0]['parsetree'] );

Here we use XPath in order to find the Infobox musical artist's parameter Origin and its value. See the XPath specification for the syntax and such. You could as well traverse the tree and look for the nodes manually. Feel free to investigate the parse tree to get a better grip of it.

    $infobox_origin = $parsetree->xpath( '//template[contains(string(title),' .
        '"Infobox musical artist")]/part[contains(string(name),"Origin")]/value' );

    echo trim( strval( $infobox_origin[0] ) );
}
扎心 2024-11-12 14:11:26

维基百科上安装的 MediaWiki 无法获取此信息(有一些扩展,例如 Semantic MediaWiki是为这类事情而设计的,但它们没有安装在维基百科上)。您可以解析输出 HTML 或解析页面的 wiki 文本,或者在某些情况下(例如出生/死亡年份)您可以通过 API 查看页面的类别。

MediaWiki as installed on Wikipedia provides no way to get this information (there are extensions such as Semantic MediaWiki that are designed for this sort of thing, but they are not installed on Wikipedia). You can either parse the output HTML or parse the page's wikitext, or in certain cases (e.g. birth/death year) you might be able to look at the page's categories via the API.

作死小能手 2024-11-12 14:11:26

这是一个陡峭的学习曲线,但 DBpedia 可以满足您的需求。

您提到的“背景信息表”在维基百科用语中称为 “信息框” DBpedia 允许对它们进行非常强大的查询。不幸的是,由于它功能强大,所以学习起来并不容易,而且我几乎忘记了一两年前学到的东西。如果我能再次学习的话,我会在这里粘贴一个查询(-:

同时,这是 DBpedia 的想法 介绍如何使用它。

之前的问题将会有所帮助:获取DBPedia 信息框类别

更新

好的,这是 SPARQL 查询:

SELECT ?org
WHERE {
    <http://dbpedia.org/resource/Radiohead> dbpprop:origin ?org
}

这是一个 URL您可以在其中看到它的工作原理并使用它。

这是该页面上的输出:(您也可以获得各种格式的输出)

SPARQL 结果:org“Abingdon,
英国牛津郡"@en

It's a steep learning curve but DBpedia does what you want.

The "Background information table" you mention is called an "Infobox" in Wikipedia parlance and DBpedia allows very powerful queries on them. Unfortunately because it's powerful it's not easy to learn and I've mostly forgotten what I learned about it a year or two ago. I'll paste a query here though if I manage to learn it again (-:

In the meantime, here is DBpedia's idea of an introduction in how to use it.

This previous SO question will help: Getting DBPedia Infobox categories

UPDATE

OK here is the SPARQL query:

SELECT ?org
WHERE {
    <http://dbpedia.org/resource/Radiohead> dbpprop:origin ?org
}

Here is a URL where you can see it working and play with it.

And here is the output on that page: (you can get output in various formats too)

SPARQL results: org "Abingdon,
Oxfordshire, England"@en

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文