Python 正则表达式与 wiki 文本

发布于 2024-10-16 05:49:34 字数 499 浏览 5 评论 0原文

我正在尝试使用 Python 正则表达式替换将 wikitext 更改为普通文本。关于 wiki 链接有两种格式规则。

  • [[页面名称]]
  • [[页面名称 |要显示的文本]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

这是一些让我头疼的文本。

这张 CD 几乎全部由乔治·马丁 [[唱片制作人|制作]] 最初的 [[甲壳虫乐队]] 歌曲的[[翻唱]]组成。

上面的文字应该改为:

这张 CD 几乎完全由乔治·马丁最初制作的披头士乐队歌曲的翻唱版本组成。

[[ ]] 和 [[ | 之间的冲突]] 语法是我的主要问题。我不需要一个复杂的正则表达式。按顺序应用多个(可能是两个)正则表达式替换是可以的。

请赐教我这个问题。

I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.

  • [[Name of page]]
  • [[Name of page | Text to display]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache.

The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.

The text above should be changed into:

The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.

The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.

Please enlighten me on this problem.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

总攻大人 2024-10-23 05:49:35

我想出了一个应该可以解决问题的正则表达式。如果有什么问题,请告诉我:(

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

哎呀,我永远无法克服这些东西是多么丑陋!)

第 1 组应该为您提供 wiki 链接。第 4 组应该为您提供链接文本,如果没有管道,则为“无”。

解释:

  • (([^\]|]|\](?=[^\]]))*) 查找所有不是“|”的字符序列或者 ”]]”。它通过查找所有非“|”的字符序列来实现此目的。或“]”OR,其中“]”后跟一个非“]”字符。
  • (\|(([^\]]|\](?=[^\]]))*))? 可选地匹配“|”后面跟着与上面相同的正则表达式,以获取链接文本部分。正则表达式略有变化,它允许使用“|”人物。
  • 显然整个事情都被包围在 \[\[ ... \]\] 中。
  • (?=...) 表示法匹配正则表达式,但不消耗其字符,因此可以随后进行匹配。我使用它是为了不消耗“|”可能紧接在“]”之后出现的字符。

编辑:我修复了正则表达式,允许在“|”之前紧接“]”,如[[abcd]|efgh]]

I came up with a regex which should do the trick. Let me know if there's anything wrong with it:

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, I will never get over how ugly these things are!)

Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.

An explanation:

  • (([^\]|]|\](?=[^\]]))*) finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".
  • (\|(([^\]]|\](?=[^\]]))*))? optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.
  • Obviously the whole thing is surrounded in \[\[ ... \]\].
  • The (?=...) notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".

Edit: I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]].

黑寡妇 2024-10-23 05:49:34
wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

示例: http://ideone.com/7oxuz

注意:您还可以在 http://www.mediawiki.org/wiki/Alternative_parsers

wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz

Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.

欢烬 2024-10-23 05:49:34

你走错了路。众所周知,Wiki 标记很难解析,并且存在如此多的异常、边缘情况和简单的损坏标记,构建您自己的正则表达式来完成此任务几乎是不可能的。由于您使用的是Python,我建议使用mwlib,它将为您完成艰苦的工作:

http ://code.pediapress.com/wiki/wiki/mwlib

You're going down the wrong path. Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. Since you're using Python, I'd suggest mwlib, which will do the hard work for you:

http://code.pediapress.com/wiki/wiki/mwlib

月下凄凉 2024-10-23 05:49:34

这应该有效:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)

This should work:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文