用于匹配多种类型编号列表的正则表达式

发布于 2024-09-07 04:40:11 字数 990 浏览 8 评论 0原文

我想创建一个 (PCRE) 正则表达式来匹配所有常用的编号列表，并且我想分享我的想法并收集有关执行此操作的方法的输入。

我将“列表”定义为一组规范的盎格鲁撒克逊约定，即

数字

1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)

、字母、

a b c
a. b. c.
a) b) c)
(a) (b) (c) 
A B C
A. B. C. 
A) B) C)
(A) (B) (C)

罗马数字

i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)

我想知道这组列表的强度，以及是否还有其他编号约定，以及是否应该删除其中任何一个。

这是我创建的一个正则表达式来解决这个问题（在Python中）：

numex = r'(?:\d{1,3}'\   # 1, 2, 3
    '(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
    '|[A-Z]{1,2}'\        # A. B. C.
    '|[ivxcl]{1,6}'       # i, iii, ...

rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?

rex.match("123. Some paragraph")

我想知道这个正则表达式对于这个问题有多充分，以及是否有其他替代方案（正则表达式或其他）解决方案。

顺便说一句，对于我的特定用例，我预计列表数量不会超过 25-50。

感谢您的阅读。

布莱恩

原文

I'd like to create a (PCRE) regular expression to match all commonly used numbered lists, and I'd like to share my thoughts and gather input on way to do this.

I've defined 'lists' as the set of canonical Anglo-Saxon conventions, i.e.

Numbers

1 2 3
1. 2. 3.
1) 2) 3)
(1) (2) (3)
1.1 1.2 1.2.1
1.1. 1.2. 1.3.
1.1) 1.2) 1.3)
(1.1) (1.2) (1.3)

Letters

a b c
a. b. c.
a) b) c)
(a) (b) (c) 
A B C
A. B. C. 
A) B) C)
(A) (B) (C)

Roman numerals

i ii iii
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)
I II III
i. ii. iii.
i) ii) iii)
(i) (ii) (iii)

I'd like to know how strong a set of list this is, and if there are other numbering conventions that should be in there, and if any of these ought to be removed.

Here's a regular expression I've created to solve this problem (in Python):

numex = r'(?:\d{1,3}'\   # 1, 2, 3
    '(?:\.\d{1,3}){0,4}'\ # 1.1, 1.1.1.1
    '|[A-Z]{1,2}'\        # A. B. C.
    '|[ivxcl]{1,6}'       # i, iii, ...

rex = re.compile(r'(\(?%s\)|%s\.?)' % numex, re.I) # re.U?

rex.match("123. Some paragraph")

I'd like to know how adequate this regex is for this problem, and if there are other alternative (regex or otherwise) solutions.

Incidentally, for my particular use-case, I wouldn't expect list numbers of more than 25-50.

Thank you for reading.

Brian

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魂归处 2024-09-14 04:40:11

这是一个 Wikified 解决方案：

 numex = r"""^(?:
      \d{1,3}                 # 1, 2, 3
          (?:\.\d{1,3}){0,4}  # 1.1, 1.1.1.1
    | [B-H] | [J-Z]         # A, B - Z caps at 26.
    | [AI](?!\s)            # Note: "A" and "I" can properly start non-lists
    | [a-z]                 # a - z
    | [ivxcl]{1,6}          # Roman ii, etc
    | [IVXCL]{1,6}          # Roman IV, etc.
    )
    """

 rex = re.compile(r'^\s*(\(?%s\)|%s\.?)\s+(.*)'
   % (numex, numex), re.X)

欢迎添加、更改和建议。

Here's a Wikified solution:

 numex = r"""^(?:
      \d{1,3}                 # 1, 2, 3
          (?:\.\d{1,3}){0,4}  # 1.1, 1.1.1.1
    | [B-H] | [J-Z]         # A, B - Z caps at 26.
    | [AI](?!\s)            # Note: "A" and "I" can properly start non-lists
    | [a-z]                 # a - z
    | [ivxcl]{1,6}          # Roman ii, etc
    | [IVXCL]{1,6}          # Roman IV, etc.
    )
    """

 rex = re.compile(r'^\s*(\(?%s\)|%s\.?)\s+(.*)'
   % (numex, numex), re.X)

Additions, changes and suggestions most welcome.

回复收藏 0 原文

野の 2024-09-14 04:40:11

我会改变至少一件事，那就是在正则表达式周围添加单词边界锚，否则它将匹配任何文本中的每个字母：

rex = re.compile(r'(\(?\b%s\)|\b%s\b\.?)' % (numex, numes), re.I|re.M)

这有一点帮助，但当然任何一个或两个字母的单词仍然会进行匹配。

您可能希望将搜索锚定在行的开头；毕竟这些字符应该是该行的第一个字符（空格除外）。否定的lookbehind在Python中不会出现，因为Python不支持可变长度的lookbehind，因此您可以将其添加到匹配括号之外：

rex = re.compile(r'^\s*(\(?%s\)|%s\b\.?)' % (numex, numex), re.I|re.M)

当然，现在您必须查看匹配对象的group(1) 仅获取实际匹配项，而不获取前导空格。

您仍然会匹配太多（例如以我以为如此或这是一个黑暗而暴风雨的夜晚开头的句子，但您的规则允许这样做，我认为您知道的这个。

I'd change at least one thing, and that is to add word boundary anchors around your regex, otherwise it will match every single letter in any text:

rex = re.compile(r'(\(?\b%s\)|\b%s\b\.?)' % (numex, numes), re.I|re.M)

This helps a little, but of course any one- or two-letter word will still be matched.

You might want to anchor the search at the start of the line; after all these characters should be the first thing on the line (except maybe whitespace). A negative lookbehind won't word in Python because Python doesn't support variable-length lookbehind, so you could add this outside the matching parentheses:

rex = re.compile(r'^\s*(\(?%s\)|%s\b\.?)' % (numex, numex), re.I|re.M)

Of course, now you must look at the match object's group(1) to only get the actual match and not the leading whitespace.

You will still match too much (e. g. sentences starting with I thought so or It was a dark and stormy night, but your rules allow this, and I think you're aware of this.

回复收藏 0 原文

~没有更多了~