从 MediaWiki 安装导出文本（MediaWiki 标记）

发布于 2024-12-09 18:36:09 字数 782 浏览 5 评论 0原文

我想导出许多文章的 MediaWiki 标记（但不是所有文章）来自本地 MediaWiki 安装。我只想要当前的文章标记，而不是历史记录或其他任何内容，并且每篇文章都有一个单独的文本文件。我想以编程方式执行此导出，最好在 MediaWiki 服务器上执行此导出，而不是远程执行。

例如，如果我对 Apple、Banana 和 Cupcake 文章感兴趣，我希望能够：

article_list = ["Apple", "Banana", "Cupcake"]
for a in article_list:
    get_article(a, a + ".txt")

我的意图是：

提取所需的文章
将 MediaWiki 标记存储在单独的文本文件中
在单独的程序中解析和处理

这对于 MediaWiki 来说是否已经可行？看起来不像。它也看起来 Pywikipediabot 没有这样的脚本。

后备方案是能够手动执行此操作（使用“导出”特殊页面）并轻松将输出解析为文本文件。是否有现有的工具可以做到这一点？有 MediaWiki XML 转储格式的描述吗？（我找不到一个。）

原文

I want to export the MediaWiki markup for a number of articles (but not all articles) from a local MediaWiki installation. I want just the current article markup, not the history or anything else, with an individual text file for each article. I want to perform this export programatically and ideally on the MediaWiki server, not remotely.

For example, if I am interested in the Apple, Banana and Cupcake articles I want to be able to:

article_list = ["Apple", "Banana", "Cupcake"]
for a in article_list:
    get_article(a, a + ".txt")

My intention is to:

extract required articles
store MediaWiki markup in individual text files
parse and process in a separate program

Is this already possible with MediaWiki? It doesn't look like it. It also doesn't look like Pywikipediabot has such a script.

A fallback would be to be able to do this manually (using the Export special page) and easily parse the output into text files. Are there existing tools to do this? Is there a description of the MediaWiki XML dump format? (I couldn't find one.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盗梦空间 2024-12-16 18:36:09

在服务器端，您只需从数据库导出即可。远程地，Pywikipediabot 有一个名为 get.py 的脚本，它可以获取给定文章的 wikicode。手动操作也很简单，就像这样（凭记忆写下，可能会发生错误）：

import wikipedia as pywikibot
site = pywikibot.getSite() # assumes you have a user-config.py with default site/user
article_list = ["Apple", "Banana", "Cupcake"]
for title in article_list:
    page = pywikibot.Page(title, site)
    text = page.get() # handling of not found etc. exceptions omitted
    file = open(title + ".txt", "wt")
    file.write(text)

由于 MediaWiki 的语言定义不明确，解析/处理它的唯一可靠方法是通过 MediaWiki 本身； Pywikipediabot 不支持此操作，并且尝试执行此操作的少数工具在处理复杂模板时失败了。

On the server side, you can just export from the database. Remotely, Pywikipediabot has a script called get.py which gets the wikicode of a given article. It is also pretty simple to do manually, somehow like this (writing this from memory, errors might occur):

import wikipedia as pywikibot
site = pywikibot.getSite() # assumes you have a user-config.py with default site/user
article_list = ["Apple", "Banana", "Cupcake"]
for title in article_list:
    page = pywikibot.Page(title, site)
    text = page.get() # handling of not found etc. exceptions omitted
    file = open(title + ".txt", "wt")
    file.write(text)

Since MediaWiki's language is not well-defined, the only reliable way to parse/process it is through MediaWiki itself; there is no support for that in Pywikipediabot, and the few tools which try to do it fail with complex templates.

回复收藏 0 原文