Python 正则表达式与 wiki 文本
我正在尝试使用 Python 正则表达式替换将 wikitext 更改为普通文本。关于 wiki 链接有两种格式规则。
- [[页面名称]]
[[页面名称 |要显示的文本]]
(http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)
这是一些让我头疼的文本。
这张 CD 几乎全部由乔治·马丁 [[唱片制作人|制作]] 最初的 [[甲壳虫乐队]] 歌曲的[[翻唱]]组成。
上面的文字应该改为:
这张 CD 几乎完全由乔治·马丁最初制作的披头士乐队歌曲的翻唱版本组成。
[[ ]] 和 [[ | 之间的冲突]] 语法是我的主要问题。我不需要一个复杂的正则表达式。按顺序应用多个(可能是两个)正则表达式替换是可以的。
请赐教我这个问题。
I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.
- [[Name of page]]
[[Name of page | Text to display]]
(http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)
Here is some text that gives me a headache.
The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.
The text above should be changed into:
The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.
The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.
Please enlighten me on this problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我想出了一个应该可以解决问题的正则表达式。如果有什么问题,请告诉我:(
哎呀,我永远无法克服这些东西是多么丑陋!)
第 1 组应该为您提供 wiki 链接。第 4 组应该为您提供链接文本,如果没有管道,则为“无”。
解释:
(([^\]|]|\](?=[^\]]))*)
查找所有不是“|”的字符序列或者 ”]]”。它通过查找所有非“|”的字符序列来实现此目的。或“]”OR,其中“]”后跟一个非“]”字符。(\|(([^\]]|\](?=[^\]]))*))?
可选地匹配“|”后面跟着与上面相同的正则表达式,以获取链接文本部分。正则表达式略有变化,它允许使用“|”人物。\[\[
...\]\]
中。(?=...)
表示法匹配正则表达式,但不消耗其字符,因此可以随后进行匹配。我使用它是为了不消耗“|”可能紧接在“]”之后出现的字符。编辑:我修复了正则表达式,允许在“|”之前紧接“]”,如
[[abcd]|efgh]]
。I came up with a regex which should do the trick. Let me know if there's anything wrong with it:
(Ick, I will never get over how ugly these things are!)
Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.
An explanation:
(([^\]|]|\](?=[^\]]))*)
finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".(\|(([^\]]|\](?=[^\]]))*))?
optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.\[\[
...\]\]
.(?=...)
notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".Edit: I fixed the regex to allow a "]" immediately before the "|", as in
[[abcd]|efgh]]
.示例: http://ideone.com/7oxuz
注意:您还可以在 http://www.mediawiki.org/wiki/Alternative_parsers。
Example: http://ideone.com/7oxuz
Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.
你走错了路。众所周知,Wiki 标记很难解析,并且存在如此多的异常、边缘情况和简单的损坏标记,构建您自己的正则表达式来完成此任务几乎是不可能的。由于您使用的是Python,我建议使用mwlib,它将为您完成艰苦的工作:
http ://code.pediapress.com/wiki/wiki/mwlib
You're going down the wrong path. Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. Since you're using Python, I'd suggest mwlib, which will do the hard work for you:
http://code.pediapress.com/wiki/wiki/mwlib
这应该有效:
This should work: