您将如何实施越位规则？

发布于 2024-07-07 15:25:09 字数 1716 浏览 11 评论 0原文

我已经编写了一个可以实现这一目的的生成器，但我想知道实现越位规则的最佳方法。

简而言之：越位规则意味着在这种情况下，缩进被识别为语法元素。

这是伪代码中的越位规则，用于制作以可用形式捕获缩进的分词器，我不想按语言限制答案：

token NEWLINE
    matches r"\n\ *"
    increase line count
    pick up and store the indentation level
    remember to also record the current level of parenthesis

procedure layout tokens
    level = stack of indentation levels
    push 0 to level
    last_newline = none
    per each token
        if it is NEWLINE put it to last_newline and get next token
        if last_newline contains something
            extract new_level and parenthesis_count from last_newline
            - if newline was inside parentheses, do nothing
            - if new_level > level.top
                push new_level to level
                emit last_newline as INDENT token and clear last_newline
            - if new_level == level.top
                emit last_newline and clear last_newline
            - otherwise
                while new_level < level.top
                    pop from level
                    if new_level > level.top
                        freak out, indentation is broken.
                    emit last_newline as DEDENT token
                clear last_newline
        emit token
    while level.top != 0
        emit token as DEDENT token
        pop from level

comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser

此布局器一次不会生成多个换行符，并且当有缩进时不会生成换行符接下来。因此解析规则仍然非常简单。我认为这很好，但请告知是否有更好的方法来实现它。

在使用它一段时间后，我注意到在 DEDENT 之后发出换行符可能会很好，这样您就可以用 NEWLINE 分隔表达式，同时将 INDENT DEDENT 保留为表达式的预告片。

原文

I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.

Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.

Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:

token NEWLINE
    matches r"\n\ *"
    increase line count
    pick up and store the indentation level
    remember to also record the current level of parenthesis

procedure layout tokens
    level = stack of indentation levels
    push 0 to level
    last_newline = none
    per each token
        if it is NEWLINE put it to last_newline and get next token
        if last_newline contains something
            extract new_level and parenthesis_count from last_newline
            - if newline was inside parentheses, do nothing
            - if new_level > level.top
                push new_level to level
                emit last_newline as INDENT token and clear last_newline
            - if new_level == level.top
                emit last_newline and clear last_newline
            - otherwise
                while new_level < level.top
                    pop from level
                    if new_level > level.top
                        freak out, indentation is broken.
                    emit last_newline as DEDENT token
                clear last_newline
        emit token
    while level.top != 0
        emit token as DEDENT token
        pop from level

comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser

This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.

While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猫卆 2024-07-14 15:25:09

在过去的几年里，我为一些以缩进为中心的特定领域语言编写了标记器和解析器，无论其价值如何，对我来说，您所拥有的看起来都相当合理。如果我没记错的话，你的方法与 Python 的方法非常相似，例如，这看起来应该有一定的分量。

在 NEWLINE NEWLINE INDENT 到达解析器之前将其转换为 INDENT 绝对看起来是正确的做法 - 总是在解析器中提前查看它是一种痛苦（IME）！实际上，我已经将这一步作为一个单独的层完成，最终形成了一个三步过程：第一个结合了词法分析器和布局器所做的事情，减去所有 NEWLINE 前瞻内容（这使得它非常简单），第二个（也非常简单））层折叠连续的 NEWLINE 并将 NEWLINE INDENT 转换为 INDENT （或者实际上，将 COLON NEWLINE INDENT 转换为 INDENT，因为在这种情况下，所有缩进块前面总是带有冒号），然后解析器是其之上的第三阶段。但对我来说，按照您描述的方式做事也很有意义，特别是如果您想将词法分析器与布局器分开，如果您使用代码生成工具，您可能会想要这样做例如，按照常见做法制作词法分析器。

我确实有一个应用程序需要对缩进规则更加灵活，本质上是让解析器在需要时强制执行它们——例如，以下内容需要在某些上下文中有效：

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

这与 INDENT 配合得不太好/DEDENT 标记，因为您最终需要为每一列缩进生成一个 INDENT，并在返回时生成相同数量的 DEDENT，除非您向前看以弄清楚缩进级别最终将在哪里，这就是它似乎您不希望标记器执行此操作。在这种情况下，我尝试了一些不同的方法，最终只是在每个 NEWLINE 标记中存储一个计数器，该计数器给出了以下逻辑行的缩进（正或负）变化。（每个标记还存储所有尾随空白，以防需要保留；对于 NEWLINE，存储的空白包括 EOL 本身、任何插入的空白行以及以下逻辑行上的缩进。）根本没有单独的 INDENT 或 DEDENT 标记。让解析器处理这个问题比仅仅嵌套缩进和缩进需要更多的工作，并且可能需要一个花哨的解析器生成器来处理复杂的语法，但它并没有我担心的那么糟糕，任何一个。同样，解析器不需要从 NEWLINE 向前查看该方案中是否出现缩进。

尽管如此，我认为您会同意在分词器/布局器中允许和保留各种看起来疯狂的空白，并让解析器决定什么是文字，什么是代码，这是一个不寻常的要求！例如，如果您只是想能够解析 Python 代码，那么您当然不希望解析器背负着缩进计数器。您做事的方式几乎肯定是适合您的应用程序和许多其他应用程序的正确方法。不过，如果其他人对如何最好地做这类事情有想法，我显然很想听听他们的意见......

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

回复收藏 0 原文

ゞ记忆︶ㄣ 2024-07-14 15:25:09

我最近一直在尝试这个，我得出的结论是，至少为了我的需要，我希望换行符标记每个“语句”的结尾，无论它是否是缩进块中的最后一个语句，即我即使在 DEDENT 之前也需要换行符。

我的解决方案是彻底改变它，我使用 LINE 标记来标记行的开头，而不是用 NEWLINES 标记行的结尾。

我有一个词法分析器，可以折叠空行（包括仅注释行）并发出一个 LINE 标记，其中包含有关最后一行缩进的信息。然后，我的预处理函数采用此标记流，并在缩进发生变化的任何行之间添加 INDENT 或 DEDENT。所以

line1
    line2
    line3
line4

会给出令牌流

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

这允许我为语句编写清晰的语法产生式，而不必担心检测语句的结尾，即使它们以嵌套、缩进、子块结尾，如果您匹配 NEWLINES （和 DEDENTS），这可能会很困难。

这是预处理器的核心，用 O'Caml 编写：

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

我还没有添加对括号的支持，但这应该是一个简单的扩展。但是，它确实避免在文件末尾发出杂散行。

Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.

My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.

I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So

line1
    line2
    line3
line4

would give the token stream

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.

Here is the core of the preprocessor, written in O'Caml:

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.

回复收藏 0 原文

緦唸λ蓇 2024-07-14 15:25:09

ruby 中的 Tokenizer 的乐趣：

def tokenize(input)
  result, prev_indent, curr_indent, line = [""], 0, 0, ""
  line_started = false

  input.each_char do |char|

    case char
    when ' '
      if line_started
        # Content already started, add it.
        line << char
      else
        # No content yet, just count.
        curr_indent += 1
      end
    when "\n"
      result.last << line + "\n"
      curr_indent, line = 0, ""
      line_started = false
    else
      # Check if we are at the first non-space character.
      unless line_started
        # Insert indent and dedent tokens if indentation changed.
        if prev_indent > curr_indent
          # 2 spaces dedentation
          ((prev_indent - curr_indent) / 2).times do
            result << :DEDENT
          end
          result << ""
        elsif prev_indent < curr_indent
          result << :INDENT
          result << ""
        end

        prev_indent = curr_indent
      end

      # Mark line as started and add char to line.
      line_started = true; line << char
    end

  end

  result
end

仅适用于两个空格缩进。结果类似于 ["Hello there from level 0\n", :INDENT, "This\nis level\ntwo\n", :DEDENT, "This is level0 again\n"]。

Tokenizer in ruby for fun:

def tokenize(input)
  result, prev_indent, curr_indent, line = [""], 0, 0, ""
  line_started = false

  input.each_char do |char|

    case char
    when ' '
      if line_started
        # Content already started, add it.
        line << char
      else
        # No content yet, just count.
        curr_indent += 1
      end
    when "\n"
      result.last << line + "\n"
      curr_indent, line = 0, ""
      line_started = false
    else
      # Check if we are at the first non-space character.
      unless line_started
        # Insert indent and dedent tokens if indentation changed.
        if prev_indent > curr_indent
          # 2 spaces dedentation
          ((prev_indent - curr_indent) / 2).times do
            result << :DEDENT
          end
          result << ""
        elsif prev_indent < curr_indent
          result << :INDENT
          result << ""
        end

        prev_indent = curr_indent
      end

      # Mark line as started and add char to line.
      line_started = true; line << char
    end

  end

  result
end

Does only work for two-space-indentation. Result is something like ["Hello there from level 0\n", :INDENT, "This\nis level\ntwo\n", :DEDENT, "This is level0 again\n"].

回复收藏 0 原文

~没有更多了~