获取维基百科文章的内容

发布于 2024-11-18 15:03:30 字数 1752 浏览 3 评论 0原文

我想使用实际的 API 获取维基百科文章的内容。现在,我完全了解 action=renderaction=raw,但我想要尽可能简单的纯文本版本。没有格式,没有链接,最好没有模板,没有引用,也没有目录。举个例子,这里是 SO 页面的摘录:

<p><b>Stack Overflow</b> is a <a href="http://en.wikipedia.org/wiki/Website" title="Website">website</a>, part of the <a href="http://en.wikipedia.org/wiki/Stack_Exchange_Network" title="Stack Exchange Network">Stack Exchange Network</a>,<sup id="cite_ref-blog_legal_1-0" class="reference"><a href="#cite_note-blog_legal-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-stackapps_legal_2-0" class="reference"><a href="#cite_note-stackapps_legal-2"><span>[</span>3<span>]</span></a></sup> featuring questions and answers on a wide range of topics in <a href="http://en.wikipedia.org/wiki/Computer_programming" title="Computer programming">computer programming</a>.<sup id="cite_ref-secrets_3-0" class="reference"><a href="#cite_note-secrets-3"><span>[</span>4<span>]</span></a></sup><sup id="cite_ref-slashdot_4-0" class="reference"><a href="#cite_note-slashdot-4"><span>[</span>5<span>]</span></a></sup><sup id="cite_ref-google-tech-talks_5-0" class="reference"><a href="#cite_note-google-tech-talks-5"><span>[</span>6<span>]</span></a></sup></p> 

这是在所有模板和东西之后。我想把它们完全删掉,找到真正的文章开始的地方。然后我需要将其进一步简化为:

Stack Overflow 是一个网站,是 Stack Exchange 网络,具有 广泛的问题和答案 计算机编程主题。

如何通过模板和 wiki 格式来直接获取原始文章内容?这将在 PHP 中实现。

I want to get the contents of a wikipedia article using the actual API. Now, I know full well about action=render and action=raw, but I want the most barebones version possible, in plain text. No formatting, no links, preferably no templates, no citations, and no TOC. To give an example, here's an excerpt from the SO page:

<p><b>Stack Overflow</b> is a <a href="http://en.wikipedia.org/wiki/Website" title="Website">website</a>, part of the <a href="http://en.wikipedia.org/wiki/Stack_Exchange_Network" title="Stack Exchange Network">Stack Exchange Network</a>,<sup id="cite_ref-blog_legal_1-0" class="reference"><a href="#cite_note-blog_legal-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-stackapps_legal_2-0" class="reference"><a href="#cite_note-stackapps_legal-2"><span>[</span>3<span>]</span></a></sup> featuring questions and answers on a wide range of topics in <a href="http://en.wikipedia.org/wiki/Computer_programming" title="Computer programming">computer programming</a>.<sup id="cite_ref-secrets_3-0" class="reference"><a href="#cite_note-secrets-3"><span>[</span>4<span>]</span></a></sup><sup id="cite_ref-slashdot_4-0" class="reference"><a href="#cite_note-slashdot-4"><span>[</span>5<span>]</span></a></sup><sup id="cite_ref-google-tech-talks_5-0" class="reference"><a href="#cite_note-google-tech-talks-5"><span>[</span>6<span>]</span></a></sup></p> 

This is after all the templates and stuff even. I want to cut those out completely, and find where the real article starts. Then I need to shave this down further to something like:

Stack Overflow is a website, part of
the Stack Exchange Network, featuring
questions and answers on a wide range
of topics in computer programming.

How can I cut through the templating and wiki formatting to get the raw article contents by themselves? This'd be implemented in PHP.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏の忆 2024-11-25 15:03:30

wikipediamediawiki api 拥有您正在寻找的一切。对于SO示例,这里是 SO wiki api 页面

我认为你不能直接通过 API 获取纯文本。您需要从这些解析器组中选择您要查找的内容。

希望这有帮助!

The wikipedia and mediawiki api has everything you are looking for. For the SO example, here is the SO wiki api page.

I don't think you can get plain text directly through the API though. You need to choose from these set of parsers for what you are looking for.

Hope this helps!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文