使用 Perl Regex 解析语法树
也许正则表达式不是解析这个的最佳方法,请告诉我是否不是。无论如何,这里有一些语法树的例子:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
无论如何,我想做的是拉出连接词(and,then,once等)及其相应的头(CC,IN,CC),我已经知道每个语法树,因此它可以充当锚点,并且我还需要检索其父级(在第一个中是 S,第二个 SBARTMP,第三个是 S)及其兄弟姐妹,如果有的话(在第一个没有,在第二个左侧兄弟姐妹中,以及第三个左侧和右侧兄弟姐妹中)。不包含任何高于父级的内容
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
此代码适用于以下情况:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
它将丢弃顶部的 X 并保留其他所有内容,但是,一旦兄弟姐妹嵌入了内容,那么它就不匹配,因为正则表达式不会深入。
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
我不知道如何解释这一点。我对 Perl 的扩展模式有点陌生,刚刚开始学习它。为了澄清一下正则表达式正在做什么:它查找两个括号内的连接词和大写字母/-组合,查找以两个括号结束的相同格式的完整父级,然后查找任意数量的同级所有括号都成对出现。
Perhaps regex is not the best way to parse this, tell me if I it is not. Anyway, here are some examples of what the syntax tree looks like:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
Anyway, what I am trying to do is pull the connective out (and, then, once, etc) and its corresponding head (CC,IN,CC), which I already know for each syntax tree so it can act as an anchor, and I also need to retrieve its parent (in the first it is S, second SBARTMP, and third it is S), and its siblings, if there are any (in the first none, in the second left hand side sibling, and third left-hand-side and right-hand-side sibling). Anything higher than the parent is not included
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
This code works for things like:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
It will throw away the X on top and keep everything else, however, once the siblings have stuff embedded in them then it does not match because the regex does not go deeper.
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
I am not sure how to account for this. I am kind of new to the extended patterns for Perl, just started learning it. To clarify a bit about what the regex is doing: it looks for the connective within two parentheses and the capital-letter/- combo, looks for a complete parent of the same format closing with two parentheses and then should look for any number of siblings that have all their parentheses paired off.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
要仅获取距离锚连接词最近的“父级”,您可以
作为递归父级执行,失败或直接执行。
(由于某种原因,我无法编辑我的其他帖子,一定是 cookie 被删除了)。
To only get the nearest 'parent' to your anchor connective you can
do it as a recursive parent with a FAIL or do it directly.
(for some reason I can't edit my other posts, must be cookies being deleted).
为什么你要放弃这个,你已经快要拥有了。试试这个:
END
Why did you give up on this, you almost had it. Try this:
END
这应该也有效
This should work as well