当前位置：文江博客话题详情

从维基百科文章中获取摘录？

发布于 2024-08-27 12:44:14 字数 448 浏览 10 评论 0原文

我一直在 Wikipedia API 上上下下，但我不明白是否有一种好的方法来获取文章的摘录（通常是第一段）。如果能获得该段落的 HTML 格式那就太好了。

我目前看到的获取类似于片段的内容的唯一方法是执行全文搜索（示例），但这并不是我真正想要的（太短）。

除了粗暴地解析 HTML/WikiText 之外，还有其他方法可以获取 Wikipedia 文章的第一段吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谜兔 2024-09-03 12:44:14

使用此链接获取 xml 形式的未解析介绍
“http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja”

早些时候我可以得到主题列表的介绍/通过像上面的链接一样添加带 src 的 iframe 来将类别中的文章添加到单个页面中。但是现在 chrome 抛出此错误 - “拒绝显示文档，因为 X-Frame-Options 禁止显示。”有办法通过吗？请帮忙..

回复收藏 0 原文

相权↑美人 2024-09-03 12:44:14

我找不到通过 API 执行此操作的方法，因此我求助于使用 PHP 的 DOM 函数来解析 HTML。这很简单，其中包括：

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>

I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>

回复收藏 0 原文

简单爱 2024-09-03 12:44:14

正如 ARAVIND VR 所指出的，在运行 MobileFrontend 扩展（包括 Wikipedia）的 wiki 上，您可以轻松地使用 MediaWiki API 获取文章摘录/www.mediawiki.org/wiki/Extension%3aMobileFrontend#prop.3Dextracts" rel="nofollow">prop=extracts API 查询。

例如，此链接将为您提供 Wikipedia 上的 Stack Overflow 文章的简短摘录在 JSON 包装器中。

查询的各种选项可用于控制摘录格式（HTML 或纯文本）、其最大长度（以字符和/或句子为单位，并可选择将其限制为文章的介绍部分）以及章节标题的格式在输出中。还可以在单个查询中从多篇文章中获取介绍摘录。

回复收藏 0 原文

恰似旧人归 2024-09-03 12:44:14

使用 API 可以仅获取文章的“简介”，参数 rvsection=0 为此处解释。

将 Wiki 文本转换为 HTML 有点困难；我想有更完整/官方的方法，但这就是我最终所做的：

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);

It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);

回复收藏 0 原文

~没有更多了~