获取维基百科文章的内容

发布于 2024-11-18 15:03:30 字数 1752 浏览 3 评论 0原文

我想使用实际的 API 获取维基百科文章的内容。现在，我完全了解 action=render 和 action=raw，但我想要尽可能简单的纯文本版本。没有格式，没有链接，最好没有模板，没有引用，也没有目录。举个例子，这里是 SO 页面的摘录：

<p><b>Stack Overflow</b> is a <a href="http://en.wikipedia.org/wiki/Website" title="Website">website</a>, part of the <a href="http://en.wikipedia.org/wiki/Stack_Exchange_Network" title="Stack Exchange Network">Stack Exchange Network</a>,<sup id="cite_ref-blog_legal_1-0" class="reference"><a href="#cite_note-blog_legal-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-stackapps_legal_2-0" class="reference"><a href="#cite_note-stackapps_legal-2"><span>[</span>3<span>]</span></a></sup> featuring questions and answers on a wide range of topics in <a href="http://en.wikipedia.org/wiki/Computer_programming" title="Computer programming">computer programming</a>.<sup id="cite_ref-secrets_3-0" class="reference"><a href="#cite_note-secrets-3"><span>[</span>4<span>]</span></a></sup><sup id="cite_ref-slashdot_4-0" class="reference"><a href="#cite_note-slashdot-4"><span>[</span>5<span>]</span></a></sup><sup id="cite_ref-google-tech-talks_5-0" class="reference"><a href="#cite_note-google-tech-talks-5"><span>[</span>6<span>]</span></a></sup></p>

这是在所有模板和东西之后。我想把它们完全删掉，找到真正的文章开始的地方。然后我需要将其进一步简化为：

Stack Overflow 是一个网站，是 Stack Exchange 网络，具有广泛的问题和答案计算机编程主题。

如何通过模板和 wiki 格式来直接获取原始文章内容？这将在 PHP 中实现。

原文

I want to get the contents of a wikipedia article using the actual API. Now, I know full well about action=render and action=raw, but I want the most barebones version possible, in plain text. No formatting, no links, preferably no templates, no citations, and no TOC. To give an example, here's an excerpt from the SO page:

<p><b>Stack Overflow</b> is a <a href="http://en.wikipedia.org/wiki/Website" title="Website">website</a>, part of the <a href="http://en.wikipedia.org/wiki/Stack_Exchange_Network" title="Stack Exchange Network">Stack Exchange Network</a>,<sup id="cite_ref-blog_legal_1-0" class="reference"><a href="#cite_note-blog_legal-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-stackapps_legal_2-0" class="reference"><a href="#cite_note-stackapps_legal-2"><span>[</span>3<span>]</span></a></sup> featuring questions and answers on a wide range of topics in <a href="http://en.wikipedia.org/wiki/Computer_programming" title="Computer programming">computer programming</a>.<sup id="cite_ref-secrets_3-0" class="reference"><a href="#cite_note-secrets-3"><span>[</span>4<span>]</span></a></sup><sup id="cite_ref-slashdot_4-0" class="reference"><a href="#cite_note-slashdot-4"><span>[</span>5<span>]</span></a></sup><sup id="cite_ref-google-tech-talks_5-0" class="reference"><a href="#cite_note-google-tech-talks-5"><span>[</span>6<span>]</span></a></sup></p>

This is after all the templates and stuff even. I want to cut those out completely, and find where the real article starts. Then I need to shave this down further to something like: