删除所有嵌套块,同时通过 python 保留非嵌套块
来源:
[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]
结果文本:
[This] is some text with
我不认为你可以为此做正则表达式,通过查看 堆栈溢出的线程。
有没有一种简单的方法可以做到这一点->或者必须使用 pyparsing (或其他解析库)?
Source:
[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]
Resultant text:
[This] is some text with
I don't think you can do a regex for this, from looking at the threads at stack overflow.
Is there a simple way to to do this -> or must one reach for pyparsing (or other parsing library)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是一个不需要任何依赖项的简单方法:扫描文本并为您跳过的大括号保留一个计数器。每次看到“[”时递增计数器;每次看到“]”时就递减它。
[
;如果它小于零,则有那么多多余的]
。)Here's an easy way that doesn't require any dependencies: scan the text and keep a counter for the braces that you pass over. Increment the counter each time you see a "["; decrement it each time you see a "]".
[
s; if it's less than zero you have that many excess]
s.)以OP的示例作为规范(必须删除任何包含进一步嵌套块的块),那么...:
这会发出
在正常时间假设下似乎是所需的结果。
这个想法是在
级别
中计算一个并行的计数列表“我们此时的嵌套程度”(即,到目前为止我们遇到了多少个打开的和尚未关闭的括号);然后使用groupby
将level
的 zip 与文本分割成具有零嵌套和嵌套 > 的备用块。 0. 对于每个块,然后计算此处的最大嵌套(对于具有零嵌套的块,将保持为零 - 更一般地说,它只是整个块中嵌套级别的最大值),并且如果生成的嵌套 <= 1 ,相应的文本块被保留。请注意,我们需要将组g
放入列表block
中,因为我们想要执行两次迭代(一次获得最大嵌套,一次将字符重新加入到一个列表中)文本块)——要在一次传递中完成此操作,我们需要在嵌套循环中保留一些辅助状态,在这种情况下这有点不太方便。Taking the OP's example as normative (any block including further nested blocks must be removed), what about...:
This emits
which under the normatime hypothesis would seem to be the desired result.
The idea is to compute, in
level
, a parallel list of counts "how nested are we at this point" (i.e., how many opened and not yet closed brackets have we met so far); then segment the zip oflevel
with the text, withgroupby
, into alternate blocks with zero nesting and nesting > 0. For each block, the maximum nesting herein is then computed (will stay at zero for blocks with zero nesting - more generally, it's just the maximum of the nesting levels throughout the block), and if the resulting nesting is <= 1, the corresponding block of text is preserved. Note that we need to make the groupg
into a listblock
as we want to perform two iteration passes (one to get the max nesting, one to rejoin the characters into a block of text) -- to do it in a single pass we'd need to keep some auxiliary state in the nested loop, which is a bit less convenient in this case.你最好编写一个解析器,特别是如果你使用像 pyparsing 这样的解析器生成器。它将更具可维护性和可扩展性。
事实上 pyparsing 已经为你实现了 解析器,你只需要编写过滤解析器输出的函数。
You will be better off writing a parser, especially if you use a parser generator like pyparsing. It will be more maintainable and extendable.
In fact pyparsing already implements the parser for you, you just need to write the function that filters the parser output.
我花了几遍编写了一个可以与 expression.transformString() 一起使用的解析器表达式,但是我在解析时很难区分嵌套和非嵌套的 []。最后,我必须在 transformString 中打开循环并显式迭代 scanString 生成器。
为了解决是否应该根据原始问题包含[some]的问题,我通过在末尾添加更多“未嵌套”文本来探索这个问题,使用以下字符串:
我的第一个解析器遵循原始问题的引导,并拒绝任何具有任何嵌套的括号表达式。我的第二遍获取任何括号表达式的顶级标记,并将它们返回到括号中 - 我不太喜欢这个解决方案,因为我们丢失了“某些”和“在各个地方”不连续的信息。所以我进行了最后一次处理,并且必须对nestedExpr 的默认行为进行轻微的更改。请参阅下面的代码:
给予:
我希望其中之一接近OP的问题。但如果不出意外,我还得进一步探索nestedExpr 的行为。
I took a couple of passes at writing a single parser expression that could be used with expression.transformString(), but I had difficulty distinguish between nested and unnested []'s at parse time. In the end I had to open up the loop in transformString and iterate over the scanString generator explicitly.
To address the question of whether [some] should be included or not based on the original question, I explored this by adding more "unnested" text at the end, using this string:
My first parser follows the original question's lead, and rejects any bracketed expression that has any nesting. My second pass takes the top level tokens of any bracketed expression, and returns them in brackets - I didn't like this solution so well, as we lose the information that "some" and "in various places" are not contiguous. So I took one last pass, and had to make a slight change to the default behavior of nestedExpr. See the code below:
Giving:
I hope one of these comes close to the OP's question. But if nothing else, I got to explore nestedExpr's behavior a little further.