使用语法在没有前瞻的情况下解析字符串？

发布于 2025-01-18 05:03:29 字数 917 浏览 3 评论 0原文

得到了此文本：

想要此||不是这个

该行看起来也可能像这样：

想要此|

不是这个管道的

。我正在使用这种语法来解析它：

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? <?before <divider>> }
       token divider { <[|]> ** 1..2 } 
       token post { \N* }
    }

有没有更好的方法来做到这一点？我很想能够做更多这样的事情：

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    }

但这是行不通的。如果我这样做：

    grammar HC {
       token TOP {  <pre>* <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 } }
       token post { \N* }
    }

Divider之前的每个字符都会获得自己的＆lt; pre＆gt;捕获。谢谢。

原文

Got this text:

Want this || Not this

The line may also look like this:

Want this | Not this

with a single pipe.

I'm using this grammar to parse it:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? <?before <divider>> }
       token divider { <[|]> ** 1..2 } 
       token post { \N* }
    }

Is there a better way to do this? I'd love to be able to do something more like this:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    }

But this does not work. And if I do this:

    grammar HC {
       token TOP {  <pre>* <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 } }
       token post { \N* }
    }

Each character before divider gets its own <pre> capture. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

來不及說愛妳 2025-01-25 05:03:29

一如既往，蒂姆托威迪。

我很高兴能够做更多这样的事情

你也可以。只需将前两个规则声明从 token 切换为 regex：

grammar HC {
  regex TOP {  <pre> <divider> <post> }
  regex pre { \N*? }
  token divider { <[|]> ** 1..2 }
  token post { \N* }
}

这有效，因为 regex 禁用 :ratchet （与启用它的 token 和 rule 不同）。

（解释为什么你需要为这两条规则关闭它超出了我的工资水平，当然是今晚，并且很可能直到其他人向我解释原因，这样我就可以假装我一直都知道。）

如果我这样做...每个角色都会获得自己的
捕获

默认情况下，“调用命名正则表达式会安装同名的命名捕获” [...后面几句话：]“如果不需要捕获，则前导点或与号将抑制它”。因此，将

 更改为 <.pre>。

接下来，您可以手动通过包装$=[pattern] 中的模式。因此，要捕获与 pre 规则的连续调用匹配的整个字符串，请包装非捕获模式 (<.pre> ;*?) 在 $

=[...]) 中：

grammar HC {
       token TOP { lt;pre>=[<.pre>*?] <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    }

As always, TIMTOWTDI.

I'd love to be able to do something more like this

You can. Just switch the first two rule declarations from token to regex:

grammar HC {
  regex TOP {  <pre> <divider> <post> }
  regex pre { \N*? }
  token divider { <[|]> ** 1..2 }
  token post { \N* }
}

This works because regex disables :ratchet (unlike token and rule which enable it).

(Explaining why you need to switch it off for both rules is beyond my paygrade, certainly for tonight, and quite possibly till someone else explains why to me so I can pretend I knew all along.)

if I do this ... each character gets its own <pre> capture

By default, "calling a named regex installs a named capture with the same name" [... couple sentences later:] "If no capture is desired, a leading dot or ampersand will suppress it". So change <pre> to <.pre>.

Next, you can manually add a named capture by wrapping a pattern in $<name>=[pattern]. So to capture the whole string matched by consecutive calls of the pre rule, wrap the non-capturing pattern (<.pre>*?) in $<pre>=[...]):

grammar HC {
       token TOP { lt;pre>=[<.pre>*?] <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    }

回复收藏 0 原文

旧时模样 2025-01-25 05:03:29

好的 - 我尝试了使用rammar :: tracer;（我们最好的朋友！），并从您的原始答案和第一个答案中得到了这一点...这两个都对我来说都是错误的...

TOP
|  pre
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * MATCH "|"
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
｢Want this | Not this｣
 pre => ｢Want this ｣
 divider => ｢|｣
 post => ｢ Not this｣

这给了我感觉到您的前部和分隔线的组合并没有融合。因此，我将代码更改为此（对PRE的定义更为明确）...

  1 use Grammar::Tracer;
  2 
  3 grammar HC {
  4        token TOP {  <pre> <divider> <post> }
  5        token pre {  <-[|]>* }
  6        token divider { <[|]> ** 1..2 }
  7        token post { \N* }
  8 }

并得到了……

TOP
|  pre
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
｢Want this | Not this｣
 pre => ｢Want this ｣
 divider => ｢|｣
 post => ｢ Not this｣

SOOO-我得出的结论是，（i）使用Grammar :: Tracer检查语法的操作是必须做的（（ ii）像原件一样的宽松定义要求解析器在每个炭边界上进行测试，（iii）尤其是如果分隔线难以固定，

我的 感觉语法（解析器）可能不太适合基础的原始数据结构，并且一组Regexes可能是一种更好的方法。

我无法确定如何使用＆lt; .ws＆gt;或等效地从捕获的结果中修剪空白空间。

OK - I tried use Grammar::Tracer; (our best friend!) and got this from your original and the first answer with regexes ... both seemed wrong to me...

TOP
|  pre
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * MATCH "|"
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
｢Want this | Not this｣
 pre => ｢Want this ｣
 divider => ｢|｣
 post => ｢ Not this｣

This gives me the feeling that your combination of pre and divider are not converging. So I altered the code to this (with a more definitive definition of pre)...

  1 use Grammar::Tracer;
  2 
  3 grammar HC {
  4        token TOP {  <pre> <divider> <post> }
  5        token pre {  <-[|]>* }
  6        token divider { <[|]> ** 1..2 }
  7        token post { \N* }
  8 }

and got this...

TOP
|  pre
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
｢Want this | Not this｣
 pre => ｢Want this ｣
 divider => ｢|｣
 post => ｢ Not this｣

Sooo - I conclude that (i) using Grammar::Tracer to inspect the operation of Grammars is a must do, (ii) a loose definition like the original requires the parser to test on every char boundary should be avoided, (iii) especially if the divider is hard to pin down

I have the wider feeling that a Grammar (parser) may not be well suited to the underlying raw data structure and that a set of regexes may be a better approach.

I failed to work out how to use <.ws> or equivalent to trim the empty spaces from the captured results.

回复收藏 0 原文

~没有更多了~