删除所有嵌套块,同时通过 python 保留非嵌套块

发布于 2024-08-16 07:16:46 字数 393 浏览 8 评论 0原文

来源:

[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]

结果文本:

[This] is some text with

我不认为你可以为此做正则表达式,通过查看 堆栈溢出的线程

有没有一种简单的方法可以做到这一点->或者必须使用 pyparsing (或其他解析库)?

Source:

[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]

Resultant text:

[This] is some text with

I don't think you can do a regex for this, from looking at the threads at stack overflow.

Is there a simple way to to do this -> or must one reach for pyparsing (or other parsing library)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

阿楠 2024-08-23 07:16:46

这是一个不需要任何依赖项的简单方法:扫描文本并为您跳过的大括号保留一个计数器。每次看到“[”时递增计数器;每次看到“]”时就递减它。

  • 只要计数器为零或一,就将您看到的文本放入输出字符串中。
  • 否则,您将处于嵌套块中,因此不要将文本放入输出字符串中。
  • 如果计数器没有以零结束,则字符串格式错误;左大括号和右大括号的数量不相等。 (如果它大于零,则有那么多多余的 [;如果它小于零,则有那么多多余的 ]。)

Here's an easy way that doesn't require any dependencies: scan the text and keep a counter for the braces that you pass over. Increment the counter each time you see a "["; decrement it each time you see a "]".

  • As long as the counter is at zero or one, put the text you see onto the output string.
  • Otherwise, you are in a nested block, so don't put the text onto the output string.
  • If the counter doesn't finish at zero, the string is malformed; you have an unequal number of opening and closing braces. (If it's greater than zero, you have that many excess [s; if it's less than zero you have that many excess ]s.)
浅忆 2024-08-23 07:16:46

以OP的示例作为规范(必须删除任何包含进一步嵌套块的块),那么...:

import itertools

x = '''[This] is some text with [some [blocks that are nested [in a [variety]
of ways]]] and some [which are not], and [any [with nesting] must go] away.'''

def nonest(txt):
  pieces = []
  d = 0
  level = []
  for c in txt:
    if c == '[': d += 1
    level.append(d)
    if c == ']': d -= 1
  for k, g in itertools.groupby(zip(txt, level), lambda x: x[1]>0):
    block = list(g)
    if max(d for c, d in block) > 1: continue
    pieces.append(''.join(c for c, d in block))
  print ''.join(pieces)

nonest(x)

这会发出

[This] is some text with  and some [which are not], and  away.

在正常时间假设下似乎是所需的结果。

这个想法是在级别中计算一个并行的计数列表“我们此时的嵌套程度”(即,到目前为止我们遇到了多少个打开的和尚未关闭的括号);然后使用 groupbylevel 的 zip 与文本分割成具有零嵌套和嵌套 > 的备用块。 0. 对于每个块,然后计算此处的最大嵌套(对于具有零嵌套的块,将保持为零 - 更一般地说,它只是整个块中嵌套级别的最大值),并且如果生成的嵌套 <= 1 ,相应的文本块被保留。请注意,我们需要将组 g 放入列表 block 中,因为我们想要执行两次迭代(一次获得最大嵌套,一次将字符重新加入到一个列表中)文本块)——要在一次传递中完成此操作,我们需要在嵌套循环中保留一些辅助状态,在这种情况下这有点不太方便。

Taking the OP's example as normative (any block including further nested blocks must be removed), what about...:

import itertools

x = '''[This] is some text with [some [blocks that are nested [in a [variety]
of ways]]] and some [which are not], and [any [with nesting] must go] away.'''

def nonest(txt):
  pieces = []
  d = 0
  level = []
  for c in txt:
    if c == '[': d += 1
    level.append(d)
    if c == ']': d -= 1
  for k, g in itertools.groupby(zip(txt, level), lambda x: x[1]>0):
    block = list(g)
    if max(d for c, d in block) > 1: continue
    pieces.append(''.join(c for c, d in block))
  print ''.join(pieces)

nonest(x)

This emits

[This] is some text with  and some [which are not], and  away.

which under the normatime hypothesis would seem to be the desired result.

The idea is to compute, in level, a parallel list of counts "how nested are we at this point" (i.e., how many opened and not yet closed brackets have we met so far); then segment the zip of level with the text, with groupby, into alternate blocks with zero nesting and nesting > 0. For each block, the maximum nesting herein is then computed (will stay at zero for blocks with zero nesting - more generally, it's just the maximum of the nesting levels throughout the block), and if the resulting nesting is <= 1, the corresponding block of text is preserved. Note that we need to make the group g into a list block as we want to perform two iteration passes (one to get the max nesting, one to rejoin the characters into a block of text) -- to do it in a single pass we'd need to keep some auxiliary state in the nested loop, which is a bit less convenient in this case.

橘和柠 2024-08-23 07:16:46

你最好编写一个解析器,特别是如果你使用像 pyparsing 这样的解析器生成器。它将更具可维护性和可扩展性。

事实上 pyparsing 已经为你实现了 解析器,你只需要编写过滤解析器输出的函数。

You will be better off writing a parser, especially if you use a parser generator like pyparsing. It will be more maintainable and extendable.

In fact pyparsing already implements the parser for you, you just need to write the function that filters the parser output.

少年亿悲伤 2024-08-23 07:16:46

我花了几遍编写了一个可以与 expression.transformString() 一起使用的解析器表达式,但是我在解析时很难区分嵌套和非嵌套的 []。最后,我必须在 transformString 中打开循环并显式迭代 scanString 生成器。

为了解决是否应该根据原始问题包含[some]的问题,我通过在末尾添加更多“未嵌套”文本来探索这个问题,使用以下字符串:

src = """[This] is some text with [some [blocks that are 
    nested [in a [variety] of ways]] in various places]"""

我的第一个解析器遵循原始问题的引导,并拒绝任何具有任何嵌套的括号表达式。我的第二遍获取任何括号表达式的顶级标记,并将它们返回到括号中 - 我不太喜欢这个解决方案,因为我们丢失了“某些”和“在各个地方”不连续的信息。所以我进行了最后一次处理,并且必须对nestedExpr 的默认行为进行轻微的更改。请参阅下面的代码:

from pyparsing import nestedExpr, ParseResults, CharsNotIn

# 1. scan the source string for nested [] exprs, and take only those that
# do not themselves contain [] exprs
out = []
last = 0
for tokens,start,end in nestedExpr("[","]").scanString(src):
    out.append(src[last:start])
    if not any(isinstance(tok,ParseResults) for tok in tokens[0]):
        out.append(src[start:end])
    last = end
out.append(src[last:])
print "".join(out)


# 2. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each
out = []
last = 0
for t,s,e in nestedExpr("[","]").scanString(src):
    out.append(src[last:s])
    topLevel = [tok for tok in t[0] if not isinstance(tok,ParseResults)]
    out.append('['+" ".join(topLevel)+']')
    last = e
out.append(src[last:])
print "".join(out)


# 3. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each, keeping each group separate
out = []
last = 0
for t,s,e in nestedExpr("[","]", CharsNotIn('[]')).scanString(src):
    out.append(src[last:s])
    for tok in t[0]:
        if isinstance(tok,ParseResults): continue
        out.append('['+tok.strip()+']')
    last = e
out.append(src[last:])
print "".join(out)

给予:

[This] is some text with 
[This] is some text with [some in various places]
[This] is some text with [some][in various places]

我希望其中之一接近OP的问题。但如果不出意外,我还得进一步探索nestedExpr 的行为。

I took a couple of passes at writing a single parser expression that could be used with expression.transformString(), but I had difficulty distinguish between nested and unnested []'s at parse time. In the end I had to open up the loop in transformString and iterate over the scanString generator explicitly.

To address the question of whether [some] should be included or not based on the original question, I explored this by adding more "unnested" text at the end, using this string:

src = """[This] is some text with [some [blocks that are 
    nested [in a [variety] of ways]] in various places]"""

My first parser follows the original question's lead, and rejects any bracketed expression that has any nesting. My second pass takes the top level tokens of any bracketed expression, and returns them in brackets - I didn't like this solution so well, as we lose the information that "some" and "in various places" are not contiguous. So I took one last pass, and had to make a slight change to the default behavior of nestedExpr. See the code below:

from pyparsing import nestedExpr, ParseResults, CharsNotIn

# 1. scan the source string for nested [] exprs, and take only those that
# do not themselves contain [] exprs
out = []
last = 0
for tokens,start,end in nestedExpr("[","]").scanString(src):
    out.append(src[last:start])
    if not any(isinstance(tok,ParseResults) for tok in tokens[0]):
        out.append(src[start:end])
    last = end
out.append(src[last:])
print "".join(out)


# 2. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each
out = []
last = 0
for t,s,e in nestedExpr("[","]").scanString(src):
    out.append(src[last:s])
    topLevel = [tok for tok in t[0] if not isinstance(tok,ParseResults)]
    out.append('['+" ".join(topLevel)+']')
    last = e
out.append(src[last:])
print "".join(out)


# 3. scan the source string for nested [] exprs, and take only the toplevel 
# tokens from each, keeping each group separate
out = []
last = 0
for t,s,e in nestedExpr("[","]", CharsNotIn('[]')).scanString(src):
    out.append(src[last:s])
    for tok in t[0]:
        if isinstance(tok,ParseResults): continue
        out.append('['+tok.strip()+']')
    last = e
out.append(src[last:])
print "".join(out)

Giving:

[This] is some text with 
[This] is some text with [some in various places]
[This] is some text with [some][in various places]

I hope one of these comes close to the OP's question. But if nothing else, I got to explore nestedExpr's behavior a little further.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文