用于匹配多种类型编号列表的正则表达式
我想创建一个 (PCRE) 正则表达式来匹配所有常用的编号列表,并且我想分享我的想法并收集有关执行此操作的方法的输入。
我将“列表”定义为一组规范的盎格鲁撒克逊约定,即
数字
1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)
、字母、
a b c
a. b. c.
a) b) c)
(a) (b) (c)
A B C
A. B. C.
A) B) C)
(A) (B) (C)
罗马数字
i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
我想知道这组列表的强度,以及是否还有其他编号约定,以及是否应该删除其中任何一个。
这是我创建的一个正则表达式来解决这个问题(在Python中):
numex = r'(?:\d{1,3}'\ # 1, 2, 3
'(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
'|[A-Z]{1,2}'\ # A. B. C.
'|[ivxcl]{1,6}' # i, iii, ...
rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?
rex.match("123. Some paragraph")
我想知道这个正则表达式对于这个问题有多充分,以及是否有其他替代方案(正则表达式或其他)解决方案。
顺便说一句,对于我的特定用例,我预计列表数量不会超过 25-50。
感谢您的阅读。
布莱恩
I'd like to create a (PCRE) regular expression to match all commonly used numbered lists, and I'd like to share my thoughts and gather input on way to do this.
I've defined 'lists' as the set of canonical Anglo-Saxon conventions, i.e.
Numbers
1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)
Letters
a b c
a. b. c.
a) b) c)
(a) (b) (c)
A B C
A. B. C.
A) B) C)
(A) (B) (C)
Roman numerals
i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I'd like to know how strong a set of list this is, and if there are other numbering conventions that should be in there, and if any of these ought to be removed.
Here's a regular expression I've created to solve this problem (in Python):
numex = r'(?:\d{1,3}'\ # 1, 2, 3
'(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
'|[A-Z]{1,2}'\ # A. B. C.
'|[ivxcl]{1,6}' # i, iii, ...
rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?
rex.match("123. Some paragraph")
I'd like to know how adequate this regex is for this problem, and if there are other alternative (regex or otherwise) solutions.
Incidentally, for my particular use-case, I wouldn't expect list numbers of more than 25-50.
Thank you for reading.
Brian
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个
Wikified
解决方案:欢迎添加、更改和建议。
Here's a
Wikified
solution:Additions, changes and suggestions most welcome.
我会改变至少一件事,那就是在正则表达式周围添加单词边界锚,否则它将匹配任何文本中的每个字母:
这有一点帮助,但当然任何一个或两个字母的单词仍然会进行匹配。
您可能希望将搜索锚定在行的开头;毕竟这些字符应该是该行的第一个字符(空格除外)。否定的lookbehind在Python中不会出现,因为Python不支持可变长度的lookbehind,因此您可以将其添加到匹配括号之外:
当然,现在您必须查看匹配对象的
group(1) 仅获取实际匹配项,而不获取前导空格。
您仍然会匹配太多(例如以
我以为如此
或这是一个黑暗而暴风雨的夜晚
开头的句子,但您的规则允许这样做,我认为您知道的这个。I'd change at least one thing, and that is to add word boundary anchors around your regex, otherwise it will match every single letter in any text:
This helps a little, but of course any one- or two-letter word will still be matched.
You might want to anchor the search at the start of the line; after all these characters should be the first thing on the line (except maybe whitespace). A negative lookbehind won't word in Python because Python doesn't support variable-length lookbehind, so you could add this outside the matching parentheses:
Of course, now you must look at the match object's
group(1)
to only get the actual match and not the leading whitespace.You will still match too much (e. g. sentences starting with
I thought so
orIt was a dark and stormy night
, but your rules allow this, and I think you're aware of this.