如何在Python中将嵌套的LaTeX宏与re匹配？

发布于 2025-01-19 14:34:22 字数 1192 浏览 2 评论 0原文

我想正确匹配 LaTeX 宏，甚至是嵌套的宏。请参阅以下内容：

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

例如，在上面，我想要将 \lr 宏与其内容相匹配。我已经尝试了以下方法，但没有一个能正确工作：

re.findall(r'(?:\\lr\{.*\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}\n\\rl{ last \\lr{end line \n} end RTL }']

即使是非贪婪版本在这种情况下也不起作用：

re.findall(r'(?:\\lr\{.*?\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right}',
 '\\lr{end line \n}']

我需要一些正则表达式来正确匹配它，类似于嵌套括号，这里我为 LaTeX 宏嵌套了大括号。

编辑：

我想得到以下匹配：

['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', 
'\\lr{nested left-to-right}',
'\\lr{end line \n}']

如果我知道嵌套的级别，那就完美了，如下所示：

[('\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}',1) 
('\\lr{nested left-to-right}',2)
('\\lr{end line \n}',1)]

原文

I wanted to match LaTeX macros correctly even the nested ones. See the following:

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

For instance, in the above, I want to match the \lr macro with its content. I have tried the following but none of them worked correctly:

re.findall(r'(?:\\lr\{.*\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}\n\\rl{ last \\lr{end line \n} end RTL }']

even non-greedy version did not work in this case:

re.findall(r'(?:\\lr\{.*?\})', s, re.DOTALL)
['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right}',
 '\\lr{end line \n}']

I need some regular expression to match it correctly, similar to nested parentheses, here I have nested curly brackets for LaTeX macros.

edit:

I'd like to get the following matches:

['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', 
'\\lr{nested left-to-right}',
'\\lr{end line \n}']

It would be perfect if I knew about the level of nesting, something like the below:

[('\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}',1) 
('\\lr{nested left-to-right}',2)
('\\lr{end line \n}',1)]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

压抑⊿情绪 2025-01-26 14:34:23

使用PYPI REGEX模块（使用PIP安装REGEX安装后）您可以使用

import regex

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

print( [x.group() for x in regex.finditer(r'\\lr(\{(?:[^{}]++|(?1))*})', s, overlapped=True)] )
# => ['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', '\\lr{nested left-to-right}', '\\lr{end line \n}']

python demo 和 Regex Demo 。

还要注意重叠= true选项REGEX.FINDITER允许匹配嵌套出现。

详细信息：

\\ lr - \ lr string
（\ {（？）*}） - 第1组（定义为递归时被引用）：
- \ { - a { char
- （？：[^{}] ++ |（？1））* - 零或更多重复
- [^{}] ++ - 除{和}的一个或多个字符，而无需重新匹配文本同样，如果触发回溯（即它以人工匹配）
- | - 或
- （？1） - 第1组模式递归
- } - a } char。

With PyPi regex module (after installing it with pip install regex) you can use

import regex

s = r'''
firstline
\lr{secondline\rl{ right-to-left
        \lr{nested left-to-right} end RTL }
        other text
}
\rl{ last \lr{end line 
} end RTL }
'''

print( [x.group() for x in regex.finditer(r'\\lr(\{(?:[^{}]++|(?1))*})', s, overlapped=True)] )
# => ['\\lr{secondline\\rl{ right-to-left\n        \\lr{nested left-to-right} end RTL }\n        other text\n}', '\\lr{nested left-to-right}', '\\lr{end line \n}']

See the Python demo and the regex demo.

Note also the overlapped=True option used with regex.finditer that allows matching nested occurrences.

Details:

\\lr - \lr string
(\{(?:[^{}]++|(?1))*}) - Group 1 (defined to be referred to while recursing):
- \{ - a { char
- (?:[^{}]++|(?1))* - zero or more repetitions of
- [^{}]++ - one or more chars other than { and } without the possibity to re-match the text again in case backtracking is triggered (i.e. it is matched possessively)
- | - or
- (?1) - Group 1 pattern recursed
- } - a } char.