Python:如何将嵌套括号与正则表达式匹配?
我正在尝试匹配一个类似数学表达式的字符串,该字符串具有嵌套的括号。
import re
p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)
['(((1+0)+1)+1)']
我希望它匹配所有包含的表达式,例如 (1+0)、((1+0)+1)...
我什至不在乎它是否与不需要的匹配(((1+0)),我可以处理这些。
为什么它还没有这样做,我该怎么做?
I'm trying to match a mathematical-expression-like string, that have nested parentheses.
import re
p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)
['(((1+0)+1)+1)']
I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.
Why it's not doing that already, and how can I do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
正如其他人提到的,正则表达式不是嵌套构造的方法。我将使用 pyparsing 给出一个基本示例:
这是一个使用示例:
输出:
(手动完成换行/缩进/注释)以
嵌套列表格式获取输出:
输出:
As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:
Here's a usage example:
Output:
(With newlining/indenting/comments done manually)
To get the output in nested list format:
Output:
一种新的常规引擎模块正准备替换 Python 中现有的模块。它引入了许多新功能,包括递归调用。
输出:
正则表达式
中的相关错误:http ://code.google.com/p/mrab-regex-hg/issues/detail?id=78There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.
Output:
Related bug in
regex
: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78正则表达式语言的功能不足以匹配任意嵌套的结构。为此,您需要一个下推自动机(即解析器)。有多种此类工具可用,例如 PLY。
Python 还为其自己的语法提供了一个解析器库,它可能会满足您的需要。然而,输出非常详细,需要一段时间才能理解。如果您对这个角度感兴趣,下面的讨论将尝试尽可能简单地解释事情。
您可以使用这个简短的函数来减轻痛苦:
数字来自 Python 模块
symbol
和token
,您可以使用它们构建从数字到名称的查找表:甚至可以将此映射折叠到shallow()函数中,这样您就可以使用字符串而不是数字:
Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.
Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.
You can ease the pain with this short function:
The numbers come from the Python modules
symbol
andtoken
, which you can use to build a lookup table from numbers to names:You could even fold this mapping into the
shallow()
function so you can work with strings instead of numbers:正则表达式尝试匹配尽可能多的文本,从而消耗掉所有字符串。它不会在该字符串的某些部分上查找正则表达式的其他匹配项。这就是为什么你只能得到一个答案。
解决方案是不使用正则表达式。如果您实际上正在尝试解析数学表达式,请使用真正的解析解决方案。如果您确实只想捕获括号内的部分,只需在看到 ( 和 ) 时循环遍历字符计数并递增计数器的减量即可。
The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.
The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.
堆栈是完成这项工作的最佳工具: -
在客户端代码中,由于函数被编写为生成器函数,只需使用 for 循环模式来展开匹配: -
此测试代码在我的屏幕上生成以下内容,注意到第二个参数打印输出指示括号的深度。
Stack is the best tool for the job: -
In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -
This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.
来自链接的答案:
来自LilyPond的convert-ly实用程序(由我自己编写/受版权保护,所以我可以在这里展示它):
convert-ly倾向于在其正则表达式中使用它作为paren_matcher(25),这对于大多数应用程序。但随后它使用它来匹配Scheme 表达式。
是的,在给定的限制之后它就会崩溃,但是将其插入正则表达式的能力仍然击败了在可用性方面支持无限深度的“正确”替代方案。
From a linked answer:
From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):
convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.
Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the "correct" alternatives supporting unlimited depth hands-down in usability.
我相信这个功能可能适合您的需要,我把它快速地组合在一起,所以请随意清理它。当做嵌套时,很容易向后思考并从那里开始工作=]
I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]
平衡对(例如括号)是正则表达式无法识别的语言的一个示例。
以下是对其原因的数学简要解释。
正则表达式是一种定义有限状态自动机(缩写为 FSM)的方法。这样的设备具有有限数量的可能状态来存储信息。如何使用该状态并没有特别限制,但这确实意味着它可以识别的不同位置的绝对最大数量。
例如,状态可用于计算不匹配的左括号等。但由于这种计数的状态量必须完全有界,因此给定的 FSM 最多可以计数到 n-1,其中 n 是表示 FSM 可以位于其中。如果 n 是 10,那么 FSM 可以匹配的不匹配左括号的最大数量是 10,直到它中断。由于完全有可能再有一个左括号,因此不可能有 FSM 能够正确识别匹配括号的完整语言。
所以呢?假设您只是选择一个非常大的n?问题在于,作为描述 FSM 的一种方式,正则表达式基本上描述了从一种状态到另一种状态的所有转换。 表达式本身必须至少增长 n 的常量因子倍数
由于对于任意 N,FSM 需要 2 次状态转换(一次用于匹配左括号,一次用于匹配右括号),因此正则 相比之下,下一类更好的语言(上下文无关语法)可以以完全紧凑的方式解决这个问题。这是 BNF 中的示例
Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.
What follows is a brief explanation of the math for why that is.
Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.
For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.
So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n
By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF
您可以使用正则表达式,但需要自己进行递归。像下面这样的东西就可以解决问题(如果您只需要查找,如您的问题所述,括号中包含的所有表达式):
但是,此代码与“正确”括号不匹配。如果您需要这样做,那么使用专门的解析器会更好。
You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):
This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.
这是针对您的问题的演示,尽管它很笨拙,但它有效
Here is a demo for your question, though it is clumsy, while it works
我的解决方案是:定义一个函数来提取最外面括号内的内容,然后重复调用该函数,直到获取最里面括号内的内容。
my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.
许多帖子建议对于嵌套大括号,
正则表达式不是这样做的方法。简单地数一下大括号:
例如,请参阅:用于检测半的正则表达式-冒号终止 C++ 的 & while 循环
这是一个完整的 Python 示例,用于迭代字符串并计算大括号的数量:
Many posts suggest that for nested braces,
REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES:
For example, see: Regular expression to detect semi-colon terminated C++ for & while loops
Here is a complete python sample to iterate through a string and count braces: