使用 Perl Regex 解析语法树

发布于 2024-10-09 02:12:17 字数 1124 浏览 3 评论 0原文

也许正则表达式不是解析这个的最佳方法,请告诉我是否不是。无论如何,这里有一些语法树的例子:

(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))   

无论如何,我想做的是拉出连接词(and,then,once等)及其相应的头(CC,IN,CC),我已经知道每个语法树,因此它可以充当锚点,并且我还需要检索其父级(在第一个中是 S,第二个 SBARTMP,第三个是 S)及其兄弟姐妹,如果有的话(在第一个没有,在第二个左侧兄弟姐妹中,以及第三个左侧和右侧兄弟姐妹中)。不包含任何高于父级的内容

my $pos = "(\\\w|-)*";  
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};  
my $connective = "once";  
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};  

此代码适用于以下情况:

my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";  
my $test2 = "(X (SBAR-TMP (IN once))";  
my $test3 = "(X (SBAR-TMP (IN once) (X as))";  
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";  

它将丢弃顶部的 X 并保留其他所有内容,但是,一旦兄弟姐妹嵌入了内容,那么它就不匹配,因为正则表达式不会深入。

my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";  

我不知道如何解释这一点。我对 Perl 的扩展模式有点陌生,刚刚开始学习它。为了澄清一下正则表达式正在做什么:它查找两个括号内的连接词和大写字母/-组合,查找以两个括号结束的相同格式的完整父级,然后查找任意数量的同级所有括号都成对出现。

Perhaps regex is not the best way to parse this, tell me if I it is not. Anyway, here are some examples of what the syntax tree looks like:

(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))   

Anyway, what I am trying to do is pull the connective out (and, then, once, etc) and its corresponding head (CC,IN,CC), which I already know for each syntax tree so it can act as an anchor, and I also need to retrieve its parent (in the first it is S, second SBARTMP, and third it is S), and its siblings, if there are any (in the first none, in the second left hand side sibling, and third left-hand-side and right-hand-side sibling). Anything higher than the parent is not included

my $pos = "(\\\w|-)*";  
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};  
my $connective = "once";  
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};  

This code works for things like:

my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";  
my $test2 = "(X (SBAR-TMP (IN once))";  
my $test3 = "(X (SBAR-TMP (IN once) (X as))";  
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";  

It will throw away the X on top and keep everything else, however, once the siblings have stuff embedded in them then it does not match because the regex does not go deeper.

my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";  

I am not sure how to account for this. I am kind of new to the extended patterns for Perl, just started learning it. To clarify a bit about what the regex is doing: it looks for the connective within two parentheses and the capital-letter/- combo, looks for a complete parent of the same format closing with two parentheses and then should look for any number of siblings that have all their parentheses paired off.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

浅笑轻吟梦一曲 2024-10-16 02:12:17

要仅获取距离锚连接词最近的“父级”,您可以
作为递归父级执行,失败或直接执行。
(由于某种原因,我无法编辑我的其他帖子,一定是 cookie 被删除了)。

use strict;
use warnings;

my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
  \s*
  ( 
     (?! $connective )
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

my $regex1 = qr/
      \( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
 /x;

my $regex2 = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my $sample = qq/
 (X (SBAR-TMP (IN once) (S sdf) (S sdf)))
 (X (SBAR-TMP (IN once))
 (X (SBAR-TMP (IN once) (X as))
 (X (SBAR-TMP (X adsf) (IN once))
 (X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
 (S (CC and))  
 (SBARTMP (IN once) (NP otherstuff))   
 (S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;

while ($sample =~ /$regex1/xg) {
    print "Found:   $1\n";
}
print '-' x 20, "\n";

while ($sample =~ /$regex2/xg) {
    print "Found:   $2\n";
}

__END__

To only get the nearest 'parent' to your anchor connective you can
do it as a recursive parent with a FAIL or do it directly.
(for some reason I can't edit my other posts, must be cookies being deleted).

use strict;
use warnings;

my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
  \s*
  ( 
     (?! $connective )
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

my $regex1 = qr/
      \( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
 /x;

my $regex2 = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my $sample = qq/
 (X (SBAR-TMP (IN once) (S sdf) (S sdf)))
 (X (SBAR-TMP (IN once))
 (X (SBAR-TMP (IN once) (X as))
 (X (SBAR-TMP (X adsf) (IN once))
 (X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
 (S (CC and))  
 (SBARTMP (IN once) (NP otherstuff))   
 (S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;

while ($sample =~ /$regex1/xg) {
    print "Found:   $1\n";
}
print '-' x 20, "\n";

while ($sample =~ /$regex2/xg) {
    print "Found:   $2\n";
}

__END__
温柔嚣张 2024-10-16 02:12:17

为什么你要放弃这个,你已经快要拥有了。试试这个:

use strict;
use warnings;

 my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
 my $sibling = qr/
  \s*
  ( 
     (?!$connect)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

 my $regex = qr/
   ( #1
     \(
        \s* [\w-]+ \s*
        (?>   $sibling* \s* $connective \s* $sibling*
            | (?1)
        )
      \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
         print "Found:   $1\n";
    }
}

my $another =<<EOS;
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S
  (S
    (NP blah
      (VP blah)
    )
    (CC then)
    (NP blah
      (VP blah
        (PP blah)
      )
    )
  )
)
EOS

print "\n---------\n";
    while ($another =~ /$regex/xg) {
         print "\nFound:\n$1\n";
    }

END

Why did you give up on this, you almost had it. Try this:

use strict;
use warnings;

 my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
 my $sibling = qr/
  \s*
  ( 
     (?!$connect)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s*
 /x;

 my $regex = qr/
   ( #1
     \(
        \s* [\w-]+ \s*
        (?>   $sibling* \s* $connective \s* $sibling*
            | (?1)
        )
      \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
         print "Found:   $1\n";
    }
}

my $another =<<EOS;
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S
  (S
    (NP blah
      (VP blah)
    )
    (CC then)
    (NP blah
      (VP blah
        (PP blah)
      )
    )
  )
)
EOS

print "\n---------\n";
    while ($another =~ /$regex/xg) {
         print "\nFound:\n$1\n";
    }

END

面如桃花 2024-10-16 02:12:17

这应该也有效

use strict;
use warnings;

my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
  (?: \s*
  ( 
     (?!$connective)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s* )
 /x;

my $regex = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
  '(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',    
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
        print "Found:   $2\n";
    }
}

my $another = "
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";

print "\n---------\n";
while ($another =~ /$regex/xg) {
    print "\nFound:\n$2\n";
}

__END__

This should work as well

use strict;
use warnings;

my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
  (?: \s*
  ( 
     (?!$connective)
     \(
        (?:
            (?> (?: [^()]+ ) )
          | (?-1)
        )*
     \)
  )
  \s* )
 /x;

my $regex = qr/
   ( #1
     \( \s*
        (  #2
           [\w-]+ \s*
           (?>   $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
               | (?1)
           )
        )
        \s*
     \)
   )
 /x;


my @tests = (
  '(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',  
  '(X (SBAR-TMP (IN once))',
  '(X (SBAR-TMP (IN once) (X as))',
  '(X (SBAR-TMP (X adsf) (IN once))',
  '(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',    
);

for my $sample (@tests)
{
    while ($sample =~ /$regex/xg) {
        print "Found:   $2\n";
    }
}

my $another = "
(S (CC and))  
(SBARTMP (IN once) (NP otherstuff))   
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";

print "\n---------\n";
while ($another =~ /$regex/xg) {
    print "\nFound:\n$2\n";
}

__END__
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文