Ruby 中的解析器：处理粘性注释和引号

发布于 2024-09-12 11:39:44 字数 1439 浏览 6 评论 0原文

我正在尝试在 Ruby 中为语法创建一个递归下降解析器，该语法由以下规则定义

输入由空格分隔卡片停用词开头，其中空白是正则表达式/[\n\t]+/
卡片可能包含关键字或/和值也用空格分隔，具有特定于卡片的顺序/模式
所有停用词和关键字都不区分大小写，即：/^[az]+[a-z0-9]*$/i
值可以是一个双引号字符串，它可以不与换句话说，用空格表示，例如：
```
word"引用字符串"word
```
值也可以是一个单词 /^[az]+[a-z0-9]*$/，或整数，或浮点（例如-1.15或1.0e+2）
单行注释由#表示并且可能不与换句话说，例如：
```
word#单行注释\n
```
多行注释由/*和*/表示，并且可能不是与其他单词分开，例如：
```
字/*多行 
评论*/字
```

# Input example. Stop-words are chosen just to highlight them: set, object
set title"Input example"set objects 2#not-separated by white-space. test: "/*
set test "#/*"
object 1 shape box/* shape is a Keyword, 
box is a Value. test: "#*/object 2 shape sphere
set data # message and complete are Values
0 0 0 0 1 18 18 18 1 35 35 35 72 35 35 # all numbers are Values of the Card "set"

由于大多数单词都是用空格分隔的，有一段时间我正在考虑分割整个输入并逐字解析。为了处理注释和引用，我打算这样做

words = input_text.gsub( /([\"\#\n]|\/\*|\*\/)/, ' \1 ' ).split( /[ \t]+/ )

，但是，通过这种方式，字符串的内容（和注释，如果我想保留它们）被修改。您将如何处理这些粘性评论和引用？

原文

I am trying to make a recursive-descent parser in Ruby for a grammar, which is defined by the following rules

Input consists of white-space separated Cards starting with a Stop-word,
where white-space is regex /[ \n\t]+/
Card may consist of Keywords or/and Values also separated by white-space,
which have card-specific order/pattern
All Stop-words and Keywords are case-insensitive, i.e.: /^[a-z]+[a-z0-9]*$/i
Value can be a double-quoted string, which may be not separated from
other words by a white-space, e.g.:
```
word"quoted string"word
```
Value can be also a word /^[a-z]+[a-z0-9]*$/, or integer, or float (e.g. -1.15, or 1.0e+2)
Single-line comment is denoted by # and may be not separated from
other words, e.g.:
```
word#single-line comment\n
```
Multi-line comment is denoted by /* and */ and may be not
separated from other words, e.g.:
```
word/*multi-line 
comment*/word
```

# Input example. Stop-words are chosen just to highlight them: set, object
set title"Input example"set objects 2#not-separated by white-space. test: "/*
set test "#/*"
object 1 shape box/* shape is a Keyword, 
box is a Value. test: "#*/object 2 shape sphere
set data # message and complete are Values
0 0 0 0 1 18 18 18 1 35 35 35 72 35 35 # all numbers are Values of the Card "set"

Since most of the words are separated by white-space, for a while I was thinking about splitting the whole input and parsing word-by-word. To deal with comments and quotes, I was going to do

words = input_text.gsub( /([\"\#\n]|\/\*|\*\/)/, ' \1 ' ).split( /[ \t]+/ )

However, in this way the content of strings (and comments, if I want to keep them) is modified. How would you deal with these sticky comments and quotes?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故事灯 2024-09-19 11:39:44

好吧，我自己做的。如果不需要以下代码的可读性，可以最小化以下

class WordParser
  attr_reader :words

  def initialize text
    @text = text
  end

  def parse
    reset_parser
    until eof?
      case curr_char
        when '"' then
          start_word and add_chars_until? '"'
          close_word
        when '#','%' then
          start_word and add_chars_until? "\n"
          close_word
        when '/' then
          if next_is? '*' then
            start_word and 2.times { add_char }
            add_char until curr_is? '*' and next_is? '/' or eof?
            2.times { add_char } unless eof?
            close_word
          else
            # parser_error "unexpected symbol '/'" # if not allowed in the grammar
            start_word unless word_already_started?
            add_char
          end
        when /[^\s]/ then
          start_word unless word_already_started?
          add_char
      else # skip whitespaces etc. between words
        move and close_word
      end
    end
    return @words
  end

private

  def reset_parser
    @position = 0
    @line, @column = 1, 1
    @words = []
    @word_started = false
  end

  def parser_error s
    Kernel.puts 'Parser error on line %d, col %d: ' + s
    raise 'Parser error'
  end

  def word_already_started?
    @word_started
  end

  def close_word
    @word_started = false
  end

  def add_chars_until? ch
    add_char until next_is? ch or eof?
    2.times { add_char } unless eof?
  end

  def add_char
    @words.last[:to] = @position
    # @words.last[:length] += 1
    # @word.last += curr_char # if one just collects words
    move
  end

  def start_word
    @words.push from: @position, to: @position, line: @line, column: @column
    # @words.push '' unless @words.last.empty? # if one just collects words
    @word_started = true
  end

  def move
    increase :@position
    return if eof?
    if prev_is? "\n"
      increase :@line
      reset :@column
    else
      increase :@column
    end
  end

  def reset var; instance_variable_set(var, 1) end
  def increase var; instance_variable_set(var, instance_variable_get(var)+1) end

  def eof?; @position >= @text.length end

  def prev_is? ch; prev_char == ch end
  def curr_is? ch; curr_char == ch end
  def next_is? ch; next_char == ch end

  def prev_char; @text[ @position-1 ] end
  def curr_char; @text[ @position   ] end
  def next_char; @text[ @position+1 ] end
end

代码使用我在问题中的示例进行测试

words = WordParser.new(text).parse
p words.collect { |w| text[ w[:from]..w[:to] ] } .to_a

# >> ["# Input example. Stop-words are chosen just to highlight them: set, object\n", 
# >>  "set", "title", "\"Input example\"", "set", "objects", "2", 
# >>  "#not-separated by white-space. test: \"/*\n", "set", "test", "\"#/*\"", 
# >>  "object", "1", "shape", "box", "/* shape is a Keyword, \nbox is a Value. test: \"#*/", 
# >>  "object", "2", "shape", "sphere", "set", "data", "# message and complete are Values\n", 
# >>  "0", "0", "0", "0", "1", "18", "18", "18", "1", "35", "35", "35", "72", 
# >>  "35", "35", "# all numbers are Values of the Card \"set\"\n"]

所以现在我可以使用类似这样的来进一步解析单词。

OK, I made it myself. One can minimize the following code if its readability is not necessary

class WordParser
  attr_reader :words

  def initialize text
    @text = text
  end

  def parse
    reset_parser
    until eof?
      case curr_char
        when '"' then
          start_word and add_chars_until? '"'
          close_word
        when '#','%' then
          start_word and add_chars_until? "\n"
          close_word
        when '/' then
          if next_is? '*' then
            start_word and 2.times { add_char }
            add_char until curr_is? '*' and next_is? '/' or eof?
            2.times { add_char } unless eof?
            close_word
          else
            # parser_error "unexpected symbol '/'" # if not allowed in the grammar
            start_word unless word_already_started?
            add_char
          end
        when /[^\s]/ then
          start_word unless word_already_started?
          add_char
      else # skip whitespaces etc. between words
        move and close_word
      end
    end
    return @words
  end

private

  def reset_parser
    @position = 0
    @line, @column = 1, 1
    @words = []
    @word_started = false
  end

  def parser_error s
    Kernel.puts 'Parser error on line %d, col %d: ' + s
    raise 'Parser error'
  end

  def word_already_started?
    @word_started
  end

  def close_word
    @word_started = false
  end

  def add_chars_until? ch
    add_char until next_is? ch or eof?
    2.times { add_char } unless eof?
  end

  def add_char
    @words.last[:to] = @position
    # @words.last[:length] += 1
    # @word.last += curr_char # if one just collects words
    move
  end

  def start_word
    @words.push from: @position, to: @position, line: @line, column: @column
    # @words.push '' unless @words.last.empty? # if one just collects words
    @word_started = true
  end

  def move
    increase :@position
    return if eof?
    if prev_is? "\n"
      increase :@line
      reset :@column
    else
      increase :@column
    end
  end

  def reset var; instance_variable_set(var, 1) end
  def increase var; instance_variable_set(var, instance_variable_get(var)+1) end

  def eof?; @position >= @text.length end

  def prev_is? ch; prev_char == ch end
  def curr_is? ch; curr_char == ch end
  def next_is? ch; next_char == ch end

  def prev_char; @text[ @position-1 ] end
  def curr_char; @text[ @position   ] end
  def next_char; @text[ @position+1 ] end
end

Test using the example I have in my question

words = WordParser.new(text).parse
p words.collect { |w| text[ w[:from]..w[:to] ] } .to_a

# >> ["# Input example. Stop-words are chosen just to highlight them: set, object\n", 
# >>  "set", "title", "\"Input example\"", "set", "objects", "2", 
# >>  "#not-separated by white-space. test: \"/*\n", "set", "test", "\"#/*\"", 
# >>  "object", "1", "shape", "box", "/* shape is a Keyword, \nbox is a Value. test: \"#*/", 
# >>  "object", "2", "shape", "sphere", "set", "data", "# message and complete are Values\n", 
# >>  "0", "0", "0", "0", "1", "18", "18", "18", "1", "35", "35", "35", "72", 
# >>  "35", "35", "# all numbers are Values of the Card \"set\"\n"]

So now I can use something like this to parse the words further.

回复收藏 0 原文

~没有更多了~