当前位置：文江博客话题详情

Python 正则表达式与 wiki 文本

发布于 2024-10-16 05:49:34 字数 499 浏览 5 评论 0原文

我正在尝试使用 Python 正则表达式替换将 wikitext 更改为普通文本。关于 wiki 链接有两种格式规则。

[[页面名称]]
[[页面名称 |要显示的文本]]
(http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

这是一些让我头疼的文本。

这张 CD 几乎全部由乔治·马丁 [[唱片制作人|制作]] 最初的 [[甲壳虫乐队]] 歌曲的[[翻唱]]组成。

上面的文字应该改为：

这张 CD 几乎完全由乔治·马丁最初制作的披头士乐队歌曲的翻唱版本组成。

[[ ]] 和 [[ | 之间的冲突]] 语法是我的主要问题。我不需要一个复杂的正则表达式。按顺序应用多个（可能是两个）正则表达式替换是可以的。

请赐教我这个问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

总攻大人 2024-10-23 05:49:35

我想出了一个应该可以解决问题的正则表达式。如果有什么问题，请告诉我：（

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

哎呀，我永远无法克服这些东西是多么丑陋！）

第 1 组应该为您提供 wiki 链接。第 4 组应该为您提供链接文本，如果没有管道，则为“无”。

解释：

(([^\]|]|\](?=[^\]]))*) 查找所有不是“|”的字符序列或者 ”]]”。它通过查找所有非“|”的字符序列来实现此目的。或“]”OR，其中“]”后跟一个非“]”字符。
(\|(([^\]]|\](?=[^\]]))*))? 可选地匹配“|”后面跟着与上面相同的正则表达式，以获取链接文本部分。正则表达式略有变化，它允许使用“|”人物。
显然整个事情都被包围在 \[\[ ... \]\] 中。
(?=...) 表示法匹配正则表达式，但不消耗其字符，因此可以随后进行匹配。我使用它是为了不消耗“|”可能紧接在“]”之后出现的字符。

编辑：我修复了正则表达式，允许在“|”之前紧接“]”，如[[abcd]|efgh]]。

I came up with a regex which should do the trick. Let me know if there's anything wrong with it:

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, I will never get over how ugly these things are!)

Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.

An explanation:

(([^\]|]|\](?=[^\]]))*) finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".
(\|(([^\]]|\](?=[^\]]))*))? optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.
Obviously the whole thing is surrounded in \[\[ ... \]\].
The (?=...) notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".

Edit: I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]].

回复收藏 0 原文

黑寡妇 2024-10-23 05:49:34

wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

示例： http://ideone.com/7oxuz

注意：您还可以在 http://www.mediawiki.org/wiki/Alternative_parsers。

wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz

Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.

回复收藏 0 原文

欢烬 2024-10-23 05:49:34

你走错了路。众所周知，Wiki 标记很难解析，并且存在如此多的异常、边缘情况和简单的损坏标记，构建您自己的正则表达式来完成此任务几乎是不可能的。由于您使用的是Python，我建议使用mwlib，它将为您完成艰苦的工作：

http ://code.pediapress.com/wiki/wiki/mwlib

回复收藏 0 原文

月下凄凉 2024-10-23 05:49:34

这应该有效：

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)

This should work:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)

回复收藏 0 原文

~没有更多了~