从维基百科文章中获取摘录?

发布于 2024-08-27 12:44:14 字数 448 浏览 10 评论 0原文

我一直在 Wikipedia API 上上下下,但我不明白是否有一种好的方法来获取文章的摘录(通常是第一段)。如果能获得该段落的 HTML 格式那就太好了。

我目前看到的获取类似于片段的内容的唯一方法是执行全文搜索(示例),但这并不是我真正想要的(太短)。

除了粗暴地解析 HTML/WikiText 之外,还有其他方法可以获取 Wikipedia 文章的第一段吗?

I've been up and down the Wikipedia API, but I can't figure out if there's a nice way to fetch the excerpt of an article (usually the first paragraph). It would be nice to get the HTML formatting of that paragraph, too.

The only way I currently see of getting something that resembles a snippet is by performing a fulltext search (example), but that's not really what I want (too short).

Is there any other way to fetch the first paragraph of a Wikipedia article than barbarically parsing HTML/WikiText?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谜兔 2024-09-03 12:44:14

使用此链接获取 xml 形式的未解析介绍
“http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja”

早些时候我可以得到主题列表的介绍/通过像上面的链接一样添加带 src 的 iframe 来将类别中的文章添加到单个页面中。但是现在 chrome 抛出此错误 - “拒绝显示文档,因为 X-Frame-Options 禁止显示。”有办法通过吗?请帮忙..

Use this link to get the unparsed intro in xml form
"http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"

Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..

相权↑美人 2024-09-03 12:44:14

我找不到通过 API 执行此操作的方法,因此我求助于使用 PHP 的 DOM 函数 来解析 HTML。这很简单,其中包括:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>

I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>
简单爱 2024-09-03 12:44:14

正如 ARAVIND VR 所指出的,在运行 MobileFrontend 扩展(包括 Wikipedia)的 wiki 上,您可以轻松地使用 MediaWiki API 获取文章摘录/www.mediawiki.org/wiki/Extension%3aMobileFrontend#prop.3Dextracts" rel="nofollow">prop=extracts API 查询。

例如,此链接将为您提供 Wikipedia 上的 Stack Overflow 文章的简短摘录 在 JSON 包装器中。

查询的各种选项可用于控制摘录格式(HTML 或纯文本)、其最大长度(以字符和/或句子为单位,并可选择将其限制为文章的介绍部分)以及章节标题的格式在输出中。还可以在单​​个查询中从多篇文章中获取介绍摘录。

As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts API query.

For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.

The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.

恰似旧人归 2024-09-03 12:44:14

使用 API 可以仅获取文章的“简介”,参数 rvsection=0此处解释

将 Wiki 文本转换为 HTML 有点困难;我想有更完整/官方的方法,但这就是我最终所做的:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);

It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文