perl 解析格式错误的括号文本

发布于 2024-12-17 00:59:59 字数 1819 浏览 1 评论 0原文

我有一串文本被分成短语,每个短语都用方括号括起来:

[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]

有时一个块不以 p 字符开头(如上面的最后一个)。

我的问题是我需要捕获每个块。在正常情况下这是可以的,但有时此输入格式错误,例如,某些块可能只有一个括号,或者没有。所以它可能看起来像这样:

 [pX textX/labelX] pY textY/labelY] textZ/labelZ

但它应该像这样出现:

 [pX textX/labelX] [pY textY/labelY] [textZ/labelZ]

问题不包括嵌套括号。在以前所未有的方式深入研究不同人的正则表达式解决方案(我是正则表达式的新手)并下载备忘单并获取正则表达式工具(Expresso)之后,我仍然不知道如何做到这一点。有什么想法吗?也许正则表达式不起作用。但这个问题如何解决呢?我想这不是一个非常独特的问题。

编辑

这里是一个具体的例子:

$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

这是@FailedDev的一个很棒的紧凑解决方案:

while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }

但我认为需要添加两点来强调问题:

  1. 有些块根本没有括号
  2. ,/PUNCw#hm/CC_PRP_MP3] 是需要分离的单独块。

然而,由于这种情况是固定的(即标点符号后跟右侧只有一个方括号的文本/标签模式),我将其硬编码到解决方案中,如下

my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
    {
        @bits = split(/ /,$&); # split by space
        push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
        push(@stuff, substr($&, 7)); # after that space is the other chunk
    }
    else { push(@stuff, $&); } 
}
foreach(@stuff){ print $_; }

所示 :添加到编辑中,除了一个问题之外,它工作得很好。最后一个 ./PUNC 被遗漏,所以输出是:

[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]

如何保留最后一个块?

I have a string of text chunked into phrases, with each phrase surrounded by square brackets:

[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]

Sometimes a chunk does not start with a p-character (like the last one above).

My problem is I need to capture each chunk. That's okay under normal circumstances, but sometimes this input is mis-formatted, for example, some chunks might have only one bracket, or none. So it might look like this:

 [pX textX/labelX] pY textY/labelY] textZ/labelZ

But it ought to come out like this:

 [pX textX/labelX] [pY textY/labelY] [textZ/labelZ]

The problem does not include nested brackets. After diving into loads of different people's regex solutions like never before (I'm new at regex), and downloading cheat-sheets and getting a Regex tool (Expresso) I still don't know how to do this. Any ideas? Maybe regex doesn't work. But how is this problem solved? I imagine it's not a very unique problem.

Edit

Here is a specific example:

$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

This is a great compact solution from @FailedDev:

while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = 
amp; }

but I think two points need to be added for emphasis in the problem:

  1. some chunks have no brackets at all
  2. ,/PUNC and w#hm/CC_PRP_MP3] are separate chunks that need to be separated.

However, since this case is a fixed one (ie. a PUNCTUATION mark followed by a text/label pattern that has only one square bracket on the right), I kind of hard-coded it into the solution like this:

my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    if(
amp; =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
    {
        @bits = split(/ /,
amp;); # split by space
        push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
        push(@stuff, substr(
amp;, 7)); # after that space is the other chunk
    }
    else { push(@stuff, 
amp;); } 
}
foreach(@stuff){ print $_; }

Trying the example I added in the edit, this works just fine except for one problem. The last ./PUNC gets left out, so the output is:

[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]

How can I keep the last chunk?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

离线来电— 2024-12-24 00:59:59

您可以使用这个

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/

假设您的字符串类似于:

[pX textX/labelX] pY textY/labelY]  pY textY/labelY]  pY textY/labelY]  [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]

它不适用于例如: pY [[[textY/labelY]

Perl 特定解决方案:

while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    # matched text = 
amp;
}

更新:

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/

这适用于您更新的字符串,但是您如果需要的话,应该修剪结果的空白。

更新:2

/(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/

我建议提出一个不同的问题,因为你原来的问题与上一个问题完全不同。

"
(                 # Match the regular expression below and capture its match into backreference number 1
                     # Match either the regular expression below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[]              # Match any character that is NOT a “[”
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      [^[ ]             # Match a single character NOT present in the list “[ ”
      .                 # Match any single character that is not a line break character
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[ ]             # Match a single character NOT present in the list “[ ”
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   |                 # Or match regular expression number 4 below (the entire group fails if this one fails to match)
      \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      [^[]              # Match any character that is NOT a “[”
         +?                # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
      (?:               # Match the regular expression below
                           # Match either the regular expression below (attempting the next alternative only if this one fails)
            \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
               +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         |                 # Or match regular expression number 2 below (the entire group fails if this one fails to match)
            $                 # Assert position at the end of the string (or before the line break at the end of the string, if any)
      )
)
"

You could use this

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/

Assuming your string is something like :

[pX textX/labelX] pY textY/labelY]  pY textY/labelY]  pY textY/labelY]  [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]

It will not work with this for example : pY [[[textY/labelY]

Perl specific solution :

while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    # matched text = 
amp;
}

Update :

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/

This works with your updated string, but you should trim the whitespace of the results, if you need to.

Update : 2

/(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/

I suggest opening a different question, because your original question is totally different than the last one.

"
(                 # Match the regular expression below and capture its match into backreference number 1
                     # Match either the regular expression below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[]              # Match any character that is NOT a “[”
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      [^[ ]             # Match a single character NOT present in the list “[ ”
      .                 # Match any single character that is not a line break character
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[ ]             # Match a single character NOT present in the list “[ ”
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   |                 # Or match regular expression number 4 below (the entire group fails if this one fails to match)
      \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      [^[]              # Match any character that is NOT a “[”
         +?                # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
      (?:               # Match the regular expression below
                           # Match either the regular expression below (attempting the next alternative only if this one fails)
            \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
               +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         |                 # Or match regular expression number 2 below (the entire group fails if this one fails to match)
            $                 # Assert position at the end of the string (or before the line break at the end of the string, if any)
      )
)
"
混吃等死 2024-12-24 00:59:59
s{
   \[?
   (?: ([^\/]\s]+) \s+ )?
   ([^\]/\s]+)
   /
   ([^\]/\s]+)
   \]?
}{
   '[' .
   ( defined($1) ? "$1 " : '' ) .
   $2 .
   '/' .
   $3 .
   ']'
}xeg;
s{
   \[?
   (?: ([^\/]\s]+) \s+ )?
   ([^\]/\s]+)
   /
   ([^\]/\s]+)
   \]?
}{
   '[' .
   ( defined($1) ? "$1 " : '' ) .
   $2 .
   '/' .
   $3 .
   ']'
}xeg;
安稳善良 2024-12-24 00:59:59

这基本上与我应用于您的上一个问题的过程相同,我只是更改了地图一点:

#!/usr/bin/perl

use strict;
use warnings;

my $string= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m\$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

my @items = split(/(\[.+?\])/, $string);

my @new_items = map { 
                     if (/^\[.+\]$/) { # items in []
                        $_;
                     } 
                     elsif (/\s/) {
                        grep m/\w/, split(/\s+/); # use grep to eliminate the split results that are the empty string
                     }
                     else { # discard empty strings
                     }
                    } @items;

print "--$_--\n" for @new_items;

您得到的输出是这样的(连字符只是为了说明缺少前导/尾随空格):

--[VP sysmH/VBD_MS3]--
--[PP ll#/IN_DET Axryn/NNS_MP]--
--,/PUNC--
--w#hm/CC_PRP_MP3]--
--[NP AEDA'/NN]--
--,/PUNC--
--[PP b#/IN m$Arkp/NN_FS]--
--[NP >HyAnA/NN]--
--./PUNC--

我认为这是您想要获得的结果。我不知道您是否会对非“仅限正则表达式”的解决方案感到满意......

This is essentially the same procedure which I applied to your previous problem, I just changed the map a bit:

#!/usr/bin/perl

use strict;
use warnings;

my $string= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m\$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

my @items = split(/(\[.+?\])/, $string);

my @new_items = map { 
                     if (/^\[.+\]$/) { # items in []
                        $_;
                     } 
                     elsif (/\s/) {
                        grep m/\w/, split(/\s+/); # use grep to eliminate the split results that are the empty string
                     }
                     else { # discard empty strings
                     }
                    } @items;

print "--$_--\n" for @new_items;

The output you get is this (the hyphens are only there to illustrate the absence of leading/trailing blanks):

--[VP sysmH/VBD_MS3]--
--[PP ll#/IN_DET Axryn/NNS_MP]--
--,/PUNC--
--w#hm/CC_PRP_MP3]--
--[NP AEDA'/NN]--
--,/PUNC--
--[PP b#/IN m$Arkp/NN_FS]--
--[NP >HyAnA/NN]--
--./PUNC--

I think this is the result you wanted to obtain. I don't know whether you will be satisfied with a non-'regex only' solution though...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文