通过 API 从 mediawiki 页面获取文本内容

发布于 2024-08-09 03:19:30 字数 342 浏览 12 评论 0原文

我对 MediaWiki 还很陌生，现在遇到了一些问题。我有一些 Wiki 页面的标题，我想使用 api.php 获取所述页面的文本，但我在 API 中找到的只是一种获取 Wiki 内容的方法页面的内容（带有 wiki 标记）。我使用了这个 HTTP 请求...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

但我只需要文本内容，不需要 Wiki 标记。 MediaWiki API 可以实现这一点吗？

原文

I'm quite new to MediaWiki, and now I have a bit of a problem.
I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

But I need only the textual content, without the Wiki markup.
Is that possible with the MediaWiki API?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁吢 2024-08-16 03:19:30

使用 action=parse 获取 html：

/api.php ?action=parse&page=test

从 html 获取文本的一种方法是将其加载到浏览器中并使用 JavaScript 遍历节点，仅查找文本节点。

回复收藏 0 原文

对岸观火 2024-08-16 03:19:30

API 的 TextExtracts 扩展可以满足您的要求。使用 prop=extracts 获得清理后的响应。例如，此链接将为您提供 Stack Overflow 文章的清理文本。同样好的一点是它仍然包含部分标签，因此您可以识别文章的各个部分。

只是为了在我的答案中包含一个可见的链接，上面的链接如下所示：

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

编辑：正如 Amr 提到的，TextExtracts 是 MediaWiki 的扩展，因此不一定适用于每个 MediaWiki 站点。

The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

回复收藏 0 原文

独夜无伴 2024-08-16 03:19:30

在 MediaWiki 页面末尾添加 ?action=raw 以原始文本格式返回最新内容。例如：- https://en.wikipedia.org/wiki/Main_Page?action=raw< /a>

回复收藏 0 原文

时间海 2024-08-16 03:19:30

您可以使用 explaintext 参数从 API 获取文本格式的 wiki 数据。另外，如果您需要访问许多图书的信息，您可以通过一次调用获取所有图书的 wiki 数据。使用竖线字符 | 分隔每个标题。例如，此 API 调用将从“Google”和“Yahoo”页面返回数据：

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

参数：

explaintext：以纯文本形式返回摘录，而不是有限的 HTML。
exlimit=max：返回多个结果。当前最大值为 20。
exintro：仅返回第一部分之前的内容。如果您想要完整的数据，只需删除它即可。
redirects=：解决重定向问题。

You can get the wiki data in text format from the API by using the explaintext parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character | to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Parameters:

explaintext: Return extracts as plain text instead of limited HTML.
exlimit=max: Return more than one result. The max is currently 20.
exintro: Return only the content before the first section. If you want the full data, just remove this.
redirects=: Resolve redirect issues.

回复收藏 0 原文

梦毁影碎の 2024-08-16 03:19:30

这是最简单的方法：
http ://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content

回复收藏 0 原文

苦妄 2024-08-16 03:19:30

提出此问题的 Python 用户可能对 wikipedia 模块感兴趣 (docs)：

import wikpedia
wikipedia.set_lang('de')
page = wikipedia.page('Wikipedia')
print(page.content)

除部分 (==) 之外的所有格式都是条带化的离开。

Python users coming to this question might be interested in the wikipedia module (docs):

import wikpedia
wikipedia.set_lang('de')
page = wikipedia.page('Wikipedia')
print(page.content)

Every formatting, except for sections (==) is striped away.

回复收藏 0 原文

标点 2024-08-16 03:19:30

我认为使用 API 不可能只获取文本。

对我有用的是请求 HTML 页面（使用在浏览器中使用的普通 URL）并删除内容 div 下的 HTML 标签。

编辑：

我使用 Java 的 HTML Parser 取得了很好的结果。它提供了如何删除给定 DIV 下的 HTML 标签的示例。

回复收藏 0 原文

冷心人i 2024-08-16 03:19:30

使用 action=render 获得尽可能干净的页面：

https://wiki.eclipse .org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I?action=render

与

https://wiki .eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I

回复收藏 0 原文

彩虹直至黑白 2024-08-16 03:19:30

将内容引入页面后，您可以做一件事 - 您可以使用 PHP 函数 strip_tags() 删除 HTML 标记。

回复收藏 0 原文

~没有更多了~

关于作者

薄情伤

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

通过 API 从 mediawiki 页面获取文本内容

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚守退让之实

小兔几

mb_3y7WUgWY

友情链接

通过 API 从 mediawiki 页面获取文本内容

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

知足的幸福

我一向站在原地

慕烟庭风

秉忠贞之诚 守退让之实

小兔几

mb_3y7WUgWY

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

秉忠贞之诚守退让之实