是否有专门用于检索内容摘要的维基百科 API?
我只需要检索维基百科页面的第一段。
内容必须采用 HTML 格式,可以在我的网站上显示(所以不 BBCode< /a>,或维基百科特殊代码!)
I need just to retrieve the first paragraph of a Wikipedia page.
Content must be HTML formatted, ready to be displayed on my website (so no BBCode, or Wikipedia special code!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
有一种方法无需任何 HTML 解析即可获取整个“介绍部分”!类似于 AnthonyS 的回答 通过附加
explaintext
参数,您可以获得纯文本的介绍部分文本。查询
获取 Stack Overflow 的纯文本介绍:
使用页面标题:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow
或者使用
pageids
:https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040
JSON响应
(删除警告)
文档:API:query/prop=extracts
There's a way to get the entire "introduction section" without any HTML parsing! Similar to AnthonyS's answer with an additional
explaintext
parameter, you can get the introduction section text in plain text.Query
Getting Stack Overflow's introduction in plain text:
Using the page title:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow
Or use
pageids
:https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040
JSON Response
(warnings stripped)
Documentation: API: query/prop=extracts
实际上有一个非常好的prop,名为extracts< /a> 可以与专门为此目的设计的查询一起使用。
摘录允许您获取文章摘录(截断的文章文本)。有一个名为 exintro 的参数,可用于检索第零部分中的文本(没有图像或信息框等附加资源)。您还可以检索更细粒度的摘录,例如按一定数量的字符 (exchars) 或按一定数量的句子 (exsentences)。
这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
和 API 沙箱 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 对此查询进行更多实验。
请注意,如果您特别想要第一段,您仍然需要按照所选答案中的建议进行一些额外的解析。此处的区别在于,此查询返回的响应比建议的其他一些 API 查询短,因为您没有需要解析的 API 响应中的其他资产(例如图像)。
文档中的警告:
There is actually a very nice prop called extracts that can be used with queries designed specifically for this purpose.
Extracts allow you to get article extracts (truncated article text). There is a parameter called exintro that can be used to retrieve the text in the zeroth section (no additional assets like images or infoboxes). You can also retrieve extracts with finer granularity such as by a certain number of characters (exchars) or by a certain number of sentences (exsentences).
Here is a sample query http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow
and the API sandbox http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow to experiment more with this query.
Please note that, if you want the first paragraph specifically, you still need to do some additional parsing as suggested in the chosen answer. The difference here is that the response returned by this query is shorter than some of the other API queries suggested, because you don't have additional assets such as images in the API response to parse.
Caveat from the docs:
自 2017 年起,维基百科提供了具有更好缓存的 REST API。在文档中,您可以找到以下完全适合您的 API用例(因为它被新的页面预览使用 特征)。
https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow
返回以下数据,可用于显示带有小缩略图的摘要:
默认情况下,它遵循重定向(因此
/api/rest_v1/page/summary/StackOverflow
也可以工作),但这可以使用?redirect=false
禁用。如果您需要从另一个域访问 API,您可以使用 CORS 标头设置 <代码>&origin=(例如,
&origin=*
)。截至 2019 年:API 似乎返回了有关页面的更多有用信息。
Since 2017 Wikipedia provides a REST API with better caching. In the documentation you can find the following API which perfectly fits your use case (as it is used by the new Page Previews feature).
https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow
returns the following data which can be used to display a summary with a small thumbnail:
By default, it follows redirects (so that
/api/rest_v1/page/summary/StackOverflow
also works), but this can be disabled with?redirect=false
.If you need to access the API from another domain you can set the CORS header with
&origin=
(e.g.,&origin=*
).As of 2019: The API seems to return more useful information about the page.
此代码允许您以纯文本形式检索页面第一段的内容。
这个答案的部分内容来自此处,因此这里。有关详细信息,请参阅 MediaWiki API 文档。
This code allows you to retrieve the content of the first paragraph of the page in plain text.
Parts of this answer come from here and thus here. See MediaWiki API documentation for more information.
是的,有。例如,如果您想获取文章第一部分的内容 Stack Overflow,请使用如下查询:
<一href="http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse ">http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse
这些部分的含义如下:
format=xml
:以 XML 形式返回结果格式化程序。其他选项(如 JSON)也可用。这不会影响页面内容本身的格式,只会影响包含的数据格式。action=query&prop=revisions
:获取有关页面修订的信息。由于我们没有指定哪个修订版本,因此使用最新版本。titles=Stack%20Overflow
:获取有关页面Stack Overflow
的信息。如果用|
分隔页面名称,则可以一次性获取多个页面的文本。rvprop=content
:返回修订版的内容(或文本)。rvsection=0
:仅返回第 0 部分的内容。rvparse
:返回解析为 HTML 的内容。请记住,这会返回整个第一部分,包括帽注(“用于其他用途……”)、信息框或图像等内容。
有几个可用于各种语言的库,可以使 API 的使用变得更容易,如果您使用其中之一可能会更好。
Yes, there is. For example, if you wanted to get the content of the first section of the article Stack Overflow, use a query like this:
http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse
The parts mean this:
format=xml
: Return the result formatter as XML. Other options (like JSON) are available. This does not affect the format of the page content itself, only the enclosing data format.action=query&prop=revisions
: Get information about the revisions of the page. Since we don't specify which revision, the latest one is used.titles=Stack%20Overflow
: Get information about the pageStack Overflow
. It's possible to get the text of more pages in one go, if you separate their names by|
.rvprop=content
: Return the content (or text) of the revision.rvsection=0
: Return only content from section 0.rvparse
: Return the content parsed as HTML.Keep in mind that this returns the whole first section including things like hatnotes (“For other uses …”), infoboxes or images.
There are several libraries available for various languages that make working with API easier, it may be better for you if you used one of them.
这是我现在正在制作的网站的代码,该网站需要获取维基百科文章的主要段落、摘要和第 0 节,并且这一切都是在浏览器(客户端 JavaScript)内完成的,这要归功于JSONP 的魔力! --> http://jsfiddle.net/gautamadude/HMJJg/1/
它使用维基百科 API获取 HTML 中的前导段落(称为第 0 节),如下所示: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text§ion=0&callback=?
然后它会去除 HTML 和其他不需要的数据,为您提供干净的文章摘要字符串。如果您愿意,可以稍作调整,在前导段落周围添加一个“p”HTML 标记,但现在它们之间只有一个换行符。
代码:
This is the code I'm using right now for a website I'm making that needs to get the leading paragraphs, summary, and section 0 of off Wikipedia articles, and it's all done within the browser (client-side JavaScript) thanks to the magic of JSONP! --> http://jsfiddle.net/gautamadude/HMJJg/1/
It uses the Wikipedia API to get the leading paragraphs (called section 0) in HTML like so: http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text§ion=0&callback=?
It then strips the HTML and other undesired data, giving you a clean string of an article summary. If you want you can, with a little tweaking, get a "p" HTML tag around the leading paragraphs, but right now there is just a newline character between them.
Code:
此 URL 将返回 XML 格式的摘要。
我创建了一个函数来从维基百科获取关键字的描述。
This URL will return summary in XML format.
I have created a function to fetch description of a keyword from Wikipedia.
您还可以通过 DBPedia 获取诸如第一段之类的内容,它采用维基百科内容并从中创建结构化信息(RDF)并通过 API 提供此功能。 DBPedia API 是一种 SPARQL API(基于 RDF),但它输出 JSON 并且非常容易包装。
作为示例,这里有一个名为 WikipediaJS 的超级简单 JavaScript 库,它可以提取结构化内容,包括摘要第一段。
您可以在这篇博文中阅读更多相关信息:WikipediaJS - 通过 Javascript 访问维基百科文章数据
JavaScript 库代码可以在 wikipedia.js。
You can also get content such as the first paragraph via DBPedia which takes Wikipedia content and creates structured information from it (RDF) and makes this available via an API. The DBPedia API is a SPARQL one (RDF-based), but it outputs JSON and it is pretty easy to wrap.
As an example here's a super simple JavaScript library named WikipediaJS that can extract structured content including a summary first paragraph.
You can read more about it in this blog post: WikipediaJS - accessing Wikipedia article data through Javascript
The JavaScript library code can be found in wikipedia.js.
abstract.xml.gz
转储 声音就像你想要的那样。The
abstract.xml.gz
dump sounds like the one you want.如果您只是查找文本(然后可以将其拆分),但不想使用 API,请查看 en.wikipedia.org/w/index.php?title=Elephant&action =原始。
If you are just looking for the text, which you can then split up, but don't want to use the API, take a look at en.wikipedia.org/w/index.php?title=Elephant&action=raw.
我的方法如下(在 PHP 中):
$utf8html
可能需要进一步清理,但基本上就是这样。My approach was as follows (in PHP):
$utf8html
might need further cleaning, but that's basically it.我尝试了迈克尔·拉帕达斯< /a> 和 @Krinkle 的解决方案,但就我而言,我很难根据大小写找到一些文章。就像这里:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin
注意,我用
exsentences=1
截断了响应,显然“标题规范化”无法正常工作:
我知道我可以轻松解决大小写问题,但也存在必须将对象转换为数组的不便。
因为我真的想要一个众所周知且定义的搜索的第一段(没有从其他文章中获取信息的风险),所以我这样做了:
https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json
请注意,在这种情况下,我使用 < 进行了截断code>limit=1
这样:
但我们必须谨慎对待搜索的大小写。
更多信息:https://www.mediawiki.org/wiki/API:Opensearch
I tried Michael Rapadas' and @Krinkle's solutions, but in my case I had trouble to find some articles depending of the capitalization. Like here:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin
Note I truncated the response with
exsentences=1
Apparently "title normalization" was not working correctly:
I know I could have sorted out the capitalization issue easily, but there was also the inconvenience of having to cast the object to an array.
Because I just really wanted the very first paragraph of a well-known and defined search (no risk to fetch info from another articles), I did it like this:
https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json
Note in this case I did the truncation with
limit=1
This way:
But we have to keep being careful with the capitalization of our search.
More information: https://www.mediawiki.org/wiki/API:Opensearch
现在,维基媒体企业版提供了一种更简单的方法,其中包含
abstract
字段。 https://enterprise.wikimedia.com/docs/data-dictionary/#abstract< /a> 在 v2/articles 端点 https://enterprise.wikimedia.com/docs/on-demand/There's a simpler way now with wikimedia enterprise with the
abstract
field. https://enterprise.wikimedia.com/docs/data-dictionary/#abstract in the v2/articles endpoint https://enterprise.wikimedia.com/docs/on-demand/