使用 ANTLR 解析 Markdown 块引用

发布于 2024-08-18 01:23:52 字数 370 浏览 12 评论 0原文

这是困扰我一段时间的事情。如何使用 ANTLR 将以下文本解析为 HTML?我似乎根本无法理解这个问题。

有什么想法吗?

Markdown:

> first line
> second line
> > nested quote

输出 HTML:

<blockquote>
  <p>first line
  second line</p>
  <blockquote>
    <p>nested quote</p>
  </blockquote>
</blockquote>

This has been something that's been bothering me for a while. How does one go about parsing the following text into the HTML below using ANTLR? I can't seem to wrap my head around this at all.

Any Ideas?

Markdown:

> first line
> second line
> > nested quote

output HTML:

<blockquote>
  <p>first line
  second line</p>
  <blockquote>
    <p>nested quote</p>
  </blockquote>
</blockquote>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

2024-08-25 01:23:52

有趣的是你提到这一点,因为我上周刚刚解决了这个问题。请参阅 JMD、Markdown 以及解析和的简要概述编译器。我正在开发一个真正的 Markdown 解析器,并使用 ANTLR 进行了尝试。

有几种方法可以解决这个问题。

首先,您可以只解析:

BLOCK_QUOTE : '>' (' ' | '\t')? ;

并在解析步骤中解决它,可能作为重写规则。

问题是,这些仅当它们出现在行的开头时才重要,因此这是另一种方法:

@members {
  int quoteDepth = 0;
}

BLOCK_QUOTE : '\n' (q+='>' (' ' | '\t')?)+
  { if ($q.size() > quoteDepth) /* emit one or more START_QUOTE tokens */
    else if ($q.size() < quoteDepth /* emit one or more END_QUOTE tokens */
    quoteDepth = $q.size(); }

上面的内容可能需要是解析器规则而不是词法规则。我忘记了。

但即使这样也不能令人满意,因为它迫使你将 Markdown 源代码视为一系列行,而这并不是你在其他部分真正想要的。

另外,通常每个词法规则只能产生一个标记,因此您必须覆盖另一个逃脱我的类以允许发出多个标记。在(非常好并且几乎是必需的)The Definitive ANTLR 中有一个这样的例子参考:构建特定于领域的语言

最终我放弃了 ANTLR 作为选择的工具。我自己的手工编码解决方案有望在接下来的一两周内出现。

Funny that you mention that because I was tackling just this problem last week. See JMD, Markdown and a Brief Overview of Parsing and Compilers. I'm working on a true Markdown parser and I tried it with ANTLR.

There are a couple of ways you can deal with this.

Firstly you could just parse:

BLOCK_QUOTE : '>' (' ' | '\t')? ;

and work it out in the parsing step, possibly as a rewrite rule.

Thing is these are only important when they appear at the beginning of a line so here is another approach:

@members {
  int quoteDepth = 0;
}

BLOCK_QUOTE : '\n' (q+='>' (' ' | '\t')?)+
  { if ($q.size() > quoteDepth) /* emit one or more START_QUOTE tokens */
    else if ($q.size() < quoteDepth /* emit one or more END_QUOTE tokens */
    quoteDepth = $q.size(); }

The above may need to be a parser rule rather than a lexical rule too. I forget.

But even this is unsatisfying because it sort of forces you to treat the Markdown source as a sequence of lines, which isn't really what you want in other parts.

Also normally each lexical rule can only result in one token so you have to overwrite another class that escapes me to allow for emitting multiple tokens. There is an example of this in the (excellent and almost required) The Definitive ANTLR Reference: Building Domain-Specific Languages.

Ultimately I abandoned ANTLR as the tool of choice for this. My own hand-coded solution should hopefully be appearing in the next week or two.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文