从维基百科文章中获取摘录?
我一直在 Wikipedia API 上上下下,但我不明白是否有一种好的方法来获取文章的摘录(通常是第一段)。如果能获得该段落的 HTML 格式那就太好了。
我目前看到的获取类似于片段的内容的唯一方法是执行全文搜索(示例),但这并不是我真正想要的(太短)。
除了粗暴地解析 HTML/WikiText 之外,还有其他方法可以获取 Wikipedia 文章的第一段吗?
I've been up and down the Wikipedia API, but I can't figure out if there's a nice way to fetch the excerpt of an article (usually the first paragraph). It would be nice to get the HTML formatting of that paragraph, too.
The only way I currently see of getting something that resembles a snippet is by performing a fulltext search (example), but that's not really what I want (too short).
Is there any other way to fetch the first paragraph of a Wikipedia article than barbarically parsing HTML/WikiText?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用此链接获取 xml 形式的未解析介绍
“http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja”
早些时候我可以得到主题列表的介绍/通过像上面的链接一样添加带 src 的 iframe 来将类别中的文章添加到单个页面中。但是现在 chrome 抛出此错误 - “拒绝显示文档,因为 X-Frame-Options 禁止显示。”有办法通过吗?请帮忙..
Use this link to get the unparsed intro in xml form
"http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"
Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..
我找不到通过 API 执行此操作的方法,因此我求助于使用 PHP 的 DOM 函数 来解析 HTML。这很简单,其中包括:
I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:
正如 ARAVIND VR 所指出的,在运行 MobileFrontend 扩展(包括 Wikipedia)的 wiki 上,您可以轻松地使用 MediaWiki API 获取文章摘录/www.mediawiki.org/wiki/Extension%3aMobileFrontend#prop.3Dextracts" rel="nofollow">
prop=extracts
API 查询。例如,此链接将为您提供 Wikipedia 上的 Stack Overflow 文章的简短摘录 在 JSON 包装器中。
查询的各种选项可用于控制摘录格式(HTML 或纯文本)、其最大长度(以字符和/或句子为单位,并可选择将其限制为文章的介绍部分)以及章节标题的格式在输出中。还可以在单个查询中从多篇文章中获取介绍摘录。
As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the
prop=extracts
API query.For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.
The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.
使用 API 可以仅获取文章的“简介”,参数
rvsection=0
为 此处解释。将 Wiki 文本转换为 HTML 有点困难;我想有更完整/官方的方法,但这就是我最终所做的:
It's possible to get only the "introduction" of the article using the API, with the parameter
rvsection=0
as explained here.Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing: