Python:如何将嵌套括号与正则表达式匹配?

发布于 2024-10-28 03:15:17 字数 307 浏览 1 评论 0原文

我正在尝试匹配一个类似数学表达式的字符串,该字符串具有嵌套的括号。

import re

p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)

['(((1+0)+1)+1)']

我希望它匹配所有包含的表达式,例如 (1+0)、((1+0)+1)...
我什至不在乎它是否与不需要的匹配(((1+0)),我可以处理这些。

为什么它还没有这样做,我该怎么做?

I'm trying to match a mathematical-expression-like string, that have nested parentheses.

import re

p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)

['(((1+0)+1)+1)']

I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.

Why it's not doing that already, and how can I do it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

风吹短裙飘 2024-11-04 03:15:17

正如其他人提到的,正则表达式不是嵌套构造的方法。我将使用 pyparsing 给出一个基本示例:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens     = pyparsing.nestedExpr( '(', ')', content=thecontent)

这是一个使用示例:

>>> parens.parseString("((a + b) + c)")

输出:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     ['a', '+', 'b'], {}   
    ),                     #  (a + b)      [closed]
    '+',
    'c'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(手动完成换行/缩进/注释)以

嵌套列表格式获取输出:

res = parens.parseString("((12 + 2) + 3)")
res.asList()

输出:

[[['12', '+', '2'], '+', '3']]

As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens     = pyparsing.nestedExpr( '(', ')', content=thecontent)

Here's a usage example:

>>> parens.parseString("((a + b) + c)")

Output:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     ['a', '+', 'b'], {}   
    ),                     #  (a + b)      [closed]
    '+',
    'c'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(With newlining/indenting/comments done manually)

To get the output in nested list format:

res = parens.parseString("((12 + 2) + 3)")
res.asList()

Output:

[[['12', '+', '2'], '+', '3']]
淤浪 2024-11-04 03:15:17

一种新的常规引擎模块正准备替换 Python 中现有的模块。它引入了许多新功能,包括递归调用。

import regex

s = 'aaa(((1+0)+1)+1)bbb'

result = regex.search(r'''
(?<rec> #capturing group rec
 \( #open parenthesis
 (?: #non-capturing group
  [^()]++ #anyting but parenthesis one or more times without backtracking
  | #or
   (?&rec) #recursive substitute of group rec
 )*
 \) #close parenthesis
)
''',s,flags=regex.VERBOSE)


print(result.captures('rec'))

输出:

['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']

正则表达式中的相关错误:http ://code.google.com/p/mrab-regex-hg/issues/detail?id=78

There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.

import regex

s = 'aaa(((1+0)+1)+1)bbb'

result = regex.search(r'''
(?<rec> #capturing group rec
 \( #open parenthesis
 (?: #non-capturing group
  [^()]++ #anyting but parenthesis one or more times without backtracking
  | #or
   (?&rec) #recursive substitute of group rec
 )*
 \) #close parenthesis
)
''',s,flags=regex.VERBOSE)


print(result.captures('rec'))

Output:

['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']

Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78

墨小墨 2024-11-04 03:15:17

正则表达式语言的功能不足以匹配任意嵌套的结构。为此,您需要一个下推自动机(即解析器)。有多种此类工具可用,例如 PLY

Python 还为其自己的语法提供了一个解析器库,它可能会满足您的需要。然而,输出非常详细,需要一段时间才能理解。如果您对这个角度感兴趣,下面的讨论将尝试尽可能简单地解释事情。

>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
 [327,
  [304,
   [305,
    [306,
     [307,
      [308,
       [310,
        [311,
         [312,
          [313,
           [314,
            [315,
             [316,
              [317,
               [318,
                [7, '('],
                [320,
                 [304,
                  [305,
                   [306,
                    [307,
                     [308,
                      [310,
                       [311,
                        [312,
                         [313,
                          [314,
                           [315,
                            [316,
                             [317,
                              [318,
                               [7, '('],
                               [320,
                                [304,
                                 [305,
                                  [306,
                                   [307,
                                    [308,
                                     [310,
                                      [311,
                                       [312,
                                        [313,
                                         [314,
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [7,
                                               '('],
                                              [320,
                                               [304,
                                                [305,
                                                 [306,
                                                  [307,
                                                   [308,
                                                    [310,
                                                     [311,
                                                      [312,
                                                       [313,
                                                        [314,
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '1']]]]],
                                                         [14,
                                                          '+'],
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '0']]]]]]]]]]]]]]]],
                                              [8,
                                               ')']]]]],
                                          [14,
                                           '+'],
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [2,
                                               '1']]]]]]]]]]]]]]]],
                               [8, ')']]]]],
                           [14, '+'],
                           [315,
                            [316,
                             [317,
                              [318, [2, '1']]]]]]]]]]]]]]]],
                [8, ')']]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

您可以使用这个简短的函数来减轻痛苦:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [ast[0]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
 [318,
  '(',
  [314,
   [318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
   '+',
   '1'],
  ')'],
 '',
 '']

数字来自 Python 模块 symboltoken,您可以使用它们构建从数字到名称的查找表

map = dict(token.tok_name.items() + symbol.sym_name.items())

:甚至可以将此映射折叠到shallow()函数中,这样您就可以使用字符串而不是数字:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [map[ast[0]]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
 ['atom',
  '(',
  ['arith_expr',
   ['atom',
    '(',
    ['arith_expr',
     ['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
     '+',
     '1'],
    ')'],
   '+',
   '1'],
  ')'],
 '',
 '']

Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.

Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.

>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
 [327,
  [304,
   [305,
    [306,
     [307,
      [308,
       [310,
        [311,
         [312,
          [313,
           [314,
            [315,
             [316,
              [317,
               [318,
                [7, '('],
                [320,
                 [304,
                  [305,
                   [306,
                    [307,
                     [308,
                      [310,
                       [311,
                        [312,
                         [313,
                          [314,
                           [315,
                            [316,
                             [317,
                              [318,
                               [7, '('],
                               [320,
                                [304,
                                 [305,
                                  [306,
                                   [307,
                                    [308,
                                     [310,
                                      [311,
                                       [312,
                                        [313,
                                         [314,
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [7,
                                               '('],
                                              [320,
                                               [304,
                                                [305,
                                                 [306,
                                                  [307,
                                                   [308,
                                                    [310,
                                                     [311,
                                                      [312,
                                                       [313,
                                                        [314,
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '1']]]]],
                                                         [14,
                                                          '+'],
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '0']]]]]]]]]]]]]]]],
                                              [8,
                                               ')']]]]],
                                          [14,
                                           '+'],
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [2,
                                               '1']]]]]]]]]]]]]]]],
                               [8, ')']]]]],
                           [14, '+'],
                           [315,
                            [316,
                             [317,
                              [318, [2, '1']]]]]]]]]]]]]]]],
                [8, ')']]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

You can ease the pain with this short function:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [ast[0]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
 [318,
  '(',
  [314,
   [318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
   '+',
   '1'],
  ')'],
 '',
 '']

The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:

map = dict(token.tok_name.items() + symbol.sym_name.items())

You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [map[ast[0]]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
 ['atom',
  '(',
  ['arith_expr',
   ['atom',
    '(',
    ['arith_expr',
     ['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
     '+',
     '1'],
    ')'],
   '+',
   '1'],
  ')'],
 '',
 '']
ヅ她的身影、若隐若现 2024-11-04 03:15:17

正则表达式尝试匹配尽可能多的文本,从而消耗掉所有字符串。它不会在该字符串的某些部分上查找正则表达式的其他匹配项。这就是为什么你只能得到一个答案。

解决方案是不使用正则表达式。如果您实际上正在尝试解析数学表达式,请使用真正的解析解决方案。如果您确实只想捕获括号内的部分,只需在看到 ( 和 ) 时循环遍历字符计数并递增计数器的减量即可。

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.

乱世争霸 2024-11-04 03:15:17

堆栈是完成这项工作的最佳工具: -

import re
def matches(line, opendelim='(', closedelim=')'):
    stack = []

    for m in re.finditer(r'[{}{}]'.format(opendelim, closedelim), line):
        pos = m.start()

        if line[pos-1] == '\\':
            # skip escape sequence
            continue

        c = line[pos]

        if c == opendelim:
            stack.append(pos+1)

        elif c == closedelim:
            if len(stack) > 0:
                prevpos = stack.pop()
                # print("matched", prevpos, pos, line[prevpos:pos])
                yield (prevpos, pos, len(stack))
            else:
                # error
                print("encountered extraneous closing quote at pos {}: '{}'".format(pos, line[pos:] ))
                pass

    if len(stack) > 0:
        for pos in stack:
            print("expecting closing quote to match open quote starting at: '{}'"
                  .format(line[pos-1:]))

在客户端代码中,由于函数被编写为生成器函数,只需使用 for 循环模式来展开匹配: -

line = '(((1+0)+1)+1)'
for openpos, closepos, level in matches(line):
    print(line[openpos:closepos], level)

此测试代码在我的屏幕上生成以下内容,注意到第二个参数打印输出指示括号的深度

1+0 2
(1+0)+1 1
((1+0)+1)+1 0

Stack is the best tool for the job: -

import re
def matches(line, opendelim='(', closedelim=')'):
    stack = []

    for m in re.finditer(r'[{}{}]'.format(opendelim, closedelim), line):
        pos = m.start()

        if line[pos-1] == '\\':
            # skip escape sequence
            continue

        c = line[pos]

        if c == opendelim:
            stack.append(pos+1)

        elif c == closedelim:
            if len(stack) > 0:
                prevpos = stack.pop()
                # print("matched", prevpos, pos, line[prevpos:pos])
                yield (prevpos, pos, len(stack))
            else:
                # error
                print("encountered extraneous closing quote at pos {}: '{}'".format(pos, line[pos:] ))
                pass

    if len(stack) > 0:
        for pos in stack:
            print("expecting closing quote to match open quote starting at: '{}'"
                  .format(line[pos-1:]))

In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -

line = '(((1+0)+1)+1)'
for openpos, closepos, level in matches(line):
    print(line[openpos:closepos], level)

This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.

1+0 2
(1+0)+1 1
((1+0)+1)+1 0
冷夜 2024-11-04 03:15:17

来自链接的答案:

来自LilyPond的convert-ly实用程序(由我自己编写/受版权保护,所以我可以在这里展示它):

def paren_matcher (n):
    # poor man's matched paren scanning, gives up
    # after n+1 levels.  Matches any string with balanced
    # parens inside; add the outer parens yourself if needed.
    # Nongreedy.
    return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n

convert-ly倾向于在其正则表达式中使用它作为paren_matcher(25),这对于大多数应用程序。但随后它使用它来匹配Scheme 表达式。

是的,在给定的限制之后它就会崩溃,但是将其插入正则表达式的能力仍然击败了在可用性方面支持无限深度的“正确”替代方案。

From a linked answer:

From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):

def paren_matcher (n):
    # poor man's matched paren scanning, gives up
    # after n+1 levels.  Matches any string with balanced
    # parens inside; add the outer parens yourself if needed.
    # Nongreedy.
    return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n

convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.

Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the "correct" alternatives supporting unlimited depth hands-down in usability.

等风也等你 2024-11-04 03:15:17

我相信这个功能可能适合您的需要,我把它快速地组合在一起,所以请随意清理它。当做嵌套时,很容易向后思考并从那里开始工作=]

def fn(string,endparens=False):
    exp = []
    idx = -1
    for char in string:
        if char == "(":
            idx += 1
            exp.append("")
        elif char == ")":
            idx -= 1
            if idx != -1:
                exp[idx] = "(" + exp[idx+1] + ")"
        else:
            exp[idx] += char
    if endparens:
        exp = ["("+val+")" for val in exp]
    return exp

I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]

def fn(string,endparens=False):
    exp = []
    idx = -1
    for char in string:
        if char == "(":
            idx += 1
            exp.append("")
        elif char == ")":
            idx -= 1
            if idx != -1:
                exp[idx] = "(" + exp[idx+1] + ")"
        else:
            exp[idx] += char
    if endparens:
        exp = ["("+val+")" for val in exp]
    return exp
溺ぐ爱和你が 2024-11-04 03:15:17

平衡对(例如括号)是正则表达式无法识别的语言的一个示例。

以下是对其原因的数学简要解释。

正则表达式是一种定义有限状态自动机(缩写为 FSM)的方法。这样的设备具有有限数量的可能状态来存储信息。如何使用该状态并没有特别限制,但这确实意味着它可以识别的不同位置的绝对最大数量。

例如,状态可用于计算不匹配的左括号等。但由于这种计数的状态量必须完全有界,因此给定的 FSM 最多可以计数到 n-1,其中 n 是表示 FSM 可以位于其中。如果 n 是 10,那么 FSM 可以匹配的不匹配左括号的最大数量是 10,直到它中断。由于完全有可能再有一个左括号,因此不可能有 FSM 能够正确识别匹配括号的完整语言。

所以呢?假设您只是选择一个非常大的n?问题在于,作为描述 FSM 的一种方式,正则表达式基本上描述了从一种状态到另一种状态的所有转换。 表达式本身必须至少增长 n 的常量因子倍数

由于对于任意 N,FSM 需要 2 次状态转换(一次用于匹配左括号,一次用于匹配右括号),因此正则 相比之下,下一类更好的语言(上下文无关语法)可以以完全紧凑的方式解决这个问题。这是 BNF 中的示例

<代码>
表达式 ::= `(` 表达式 `)` 表达式
        |    什么都没有
 

Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.

What follows is a brief explanation of the math for why that is.

Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.

For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.

So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n

By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF


expression ::= `(` expression `)` expression
           |    nothing
 

独留℉清风醉 2024-11-04 03:15:17

您可以使用正则表达式,但需要自己进行递归。像下面这样的东西就可以解决问题(如果您只需要查找,如您的问题所述,括号中包含的所有表达式):

import re

def scan(p, string):
    found = p.findall(string)
    for substring in found:
        stripped = substring[1:-1]
        found.extend(scan(p, stripped))
    return found

p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found

但是,此代码与“正确”括号不匹配。如果您需要这样做,那么使用专门的解析器会更好。

You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):

import re

def scan(p, string):
    found = p.findall(string)
    for substring in found:
        stripped = substring[1:-1]
        found.extend(scan(p, stripped))
    return found

p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found

This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.

唠甜嗑 2024-11-04 03:15:17

这是针对您的问题的演示,尽管它很笨拙,但它有效

import re s = '(((1+0)+1)+1)'

def getContectWithinBraces( x , *args , **kwargs):
    ptn = r'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]' %kwargs
    Res = []
    res = re.findall(ptn , x)
    while res != []:
        Res = Res + res
        xx = x.replace('(%s)' %Res[-1] , '%s')
        res = re.findall(ptn, xx)
        print(res)
        if res != []:
            res[0] = res[0] %('(%s)' %Res[-1])
    return Res

getContectWithinBraces(s , left='\(\[\{' , right = '\)\]\}')

Here is a demo for your question, though it is clumsy, while it works

import re s = '(((1+0)+1)+1)'

def getContectWithinBraces( x , *args , **kwargs):
    ptn = r'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]' %kwargs
    Res = []
    res = re.findall(ptn , x)
    while res != []:
        Res = Res + res
        xx = x.replace('(%s)' %Res[-1] , '%s')
        res = re.findall(ptn, xx)
        print(res)
        if res != []:
            res[0] = res[0] %('(%s)' %Res[-1])
    return Res

getContectWithinBraces(s , left='\(\[\{' , right = '\)\]\}')
挖个坑埋了你 2024-11-04 03:15:17

我的解决方案是:定义一个函数来提取最外面括号内的内容,然后重复调用该函数,直到获取最里面括号内的内容。

def get_string_inside_outermost_parentheses(text):
    content_p = re.compile(r"(?<=\().*(?=\))")
    r = content_p.search(text)
    return r.group() 

def get_string_inside_innermost_parentheses(text):
    while '(' in text:
        text = get_string_inside_outermost_parentheses(text)
    return text

my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.

def get_string_inside_outermost_parentheses(text):
    content_p = re.compile(r"(?<=\().*(?=\))")
    r = content_p.search(text)
    return r.group() 

def get_string_inside_innermost_parentheses(text):
    while '(' in text:
        text = get_string_inside_outermost_parentheses(text)
    return text
最后的乘客 2024-11-04 03:15:17

许多帖子建议对于嵌套大括号,
正则表达式不是这样做的方法。简单地数一下大括号:
例如,请参阅:用于检测半的正则表达式-冒号终止 C++ 的 & while 循环

这是一个完整的 Python 示例,用于迭代字符串并计算大括号的数量:

# decided for nested braces to not use regex but brace-counting
import re, string

texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.

 This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}

more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
    openBr = 0
    foundpos = string.find(texta, r'\note',pos)
    # print 'foundpos',foundpos
    pos = foundpos + 5
    # print texta[pos]
    result = ""
    while foundpos > -1 and openBr >= 0:
        pos = pos + 1
        if texta[pos] == "{":
            openBr = openBr + 1
        if texta[pos] == "}":
            openBr = openBr - 1
        result = result + texta[pos]
    result = result[:-1] # drop the last } found.
    result = string.replace(result,'\n', ' ') # replace new line with space
    print result

Many posts suggest that for nested braces,
REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES:
For example, see: Regular expression to detect semi-colon terminated C++ for & while loops

Here is a complete python sample to iterate through a string and count braces:

# decided for nested braces to not use regex but brace-counting
import re, string

texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.

 This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}

more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
    openBr = 0
    foundpos = string.find(texta, r'\note',pos)
    # print 'foundpos',foundpos
    pos = foundpos + 5
    # print texta[pos]
    result = ""
    while foundpos > -1 and openBr >= 0:
        pos = pos + 1
        if texta[pos] == "{":
            openBr = openBr + 1
        if texta[pos] == "}":
            openBr = openBr - 1
        result = result + texta[pos]
    result = result[:-1] # drop the last } found.
    result = string.replace(result,'\n', ' ') # replace new line with space
    print result
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文