如何在 Ruby 中标记该字符串？

发布于 2024-07-17 01:08:21 字数 409 浏览 9 评论 0原文

我有这个字符串：

%{Children^10 Health "sanitation management"^5}

我想将其转换为将其标记为哈希数组：

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

我知道 StringScanner 和语法 gem 但我找不到足够的代码示例。

有什么指点吗？

原文

I have this string:

%{Children^10 Health "sanitation management"^5}

And I want to convert it to tokenize this into an array of hashes:

[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]

I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.

Any pointers?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

櫻之舞 2024-07-24 01:08:21

对于真正的语言，词法分析器是正确的选择 - 正如格斯所说。但是，如果完整的语言仅像您的示例一样复杂，您可以使用以下快速技巧：

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
       { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
     end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

如果您尝试解析常规语言，那么此方法就足够了 - 尽管不需要更多的复杂性即可使该语言成为非-常规的。

正则表达式的快速细分：

\w+ 匹配任何单项关键字
(?:\\.|[^\\"]])* 使用非捕获括号((?:...)) 匹配转义双引号字符串的内容 - 转义符号 (\n, \"、\\ 等）或任何不是转义符号或结束引号的单个字符。
"((?:\\.|[^\\"]])*)" 仅捕获带引号的关键字短语的内容。
(?:(\w+)|"( (?:\\.|[^\\"])*)") 匹配任何关键字 - 单个术语或短语，将单个术语捕获到 $1 并将短语内容捕获到 $2
\d+ 匹配数字。
\^(\d+) 捕获插入符号 (^) 后面的数字。由于这是第三组捕获括号，因此它将被捕获到 $3 中。
(?:\^(\d+))? 捕获插入符号后面的数字（如果存在），否则匹配空字符串。

String#scan(regex) 将正则表达式与字符串进行尽可能多次的匹配，输出“匹配”数组。如果正则表达式包含捕获括号，则“匹配”是捕获的项目数组 - 因此 $1 变为 match[0]，$2 变为 < code>match[1] 等。任何未与字符串部分匹配的捕获括号都会映射到结果“match”中的 nil 条目。

然后，#map 获取这些匹配项，使用一些块魔法将每个捕获的术语分解为不同的变量（我们可以这样做 do |match| ; word,phrase,boost = *match)，然后创建您想要的哈希值。 word 或 phrase 中的一个将为 nil，因为两者都无法与输入匹配，因此 (word || phrase) 将返回非 nil 1，而 #downcase 会将其转换为全部小写。 boost.to_i 会将字符串转换为整数，而 (boost.nil? ? nil : boost.to_i) 将确保 nil 提升保持不变无。

For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:

irb> text = %{Children^10 Health "sanitation management"^5}
irb> text.scan(/(?:(\w+)|"((?:\\.|[^\\"])*)")(?:\^(\d+))?/).map do |word,phrase,boost|
       { :keywords => (word || phrase).downcase, :boost => (boost.nil? ? nil : boost.to_i) }
     end
#=> [{:boost=>10, :keywords=>"children"}, {:boost=>nil, :keywords=>"health"}, {:boost=>5, :keywords=>"sanitation management"}]

If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.

A quick breakdown of the regex:

\w+ matches any single-term keywords
(?:\\.|[^\\"]])* uses non-capturing parentheses ((?:...)) to match the contents of an escaped double quoted string - either an escaped symbol (\n, \", \\, etc.) or any single character that's not an escape symbol or an end quote.
"((?:\\.|[^\\"]])*)" captures only the contents of a quoted keyword phrase.
(?:(\w+)|"((?:\\.|[^\\"])*)") matches any keyword - single term or phrase, capturing single terms into $1 and phrase contents into $2
\d+ matches a number.
\^(\d+) captures a number following a caret (^). Since this is the third set of capturing parentheses, it will be caputred into $3.
(?:\^(\d+))? captures a number following a caret if it's there, matches the empty string otherwise.

String#scan(regex) matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so $1 becomes match[0], $2 becomes match[1], etc. Any capturing parenthesis that doesn't get matched against part of the string maps to a nil entry in the resulting "match".

The #map then takes these matches, uses some block magic to break each captured term into different variables (we could have done do |match| ; word,phrase,boost = *match), and then creates your desired hashes. Exactly one of word or phrase will be nil, since both can't be matched against the input, so (word || phrase) will return the non-nil one, and #downcase will convert it to all lowercase. boost.to_i will convert a string to an integer while (boost.nil? ? nil : boost.to_i) will ensure that nil boosts stay nil.

回复收藏 0 原文

嗼ふ静 2024-07-24 01:08:21

这是一个使用 StringScanner 的非稳健示例。这是我刚刚改编自 Ruby Quiz: Parsing JSON 的代码，其中有很好的解释。

require 'strscan'

def test_parse
  text = %{Children^10 Health "sanitation management"^5}
  expected = [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]


  assert_equal(expected, parse(text))
end

def parse(text)
  @input = StringScanner.new(text)

  output = []

  while keyword = parse_string || parse_quoted_string
    output << {
      :keywords => keyword,
      :boost => parse_boost
    }
    trim_space
  end

  output
end

def parse_string
  if @input.scan(/\w+/)
    @input.matched.downcase
  else
    nil
  end
end

def parse_quoted_string
  if @input.scan(/"/)
    str = parse_quoted_contents
    @input.scan(/"/) or raise "unclosed string"
    str
  else
    nil
  end
end

def parse_quoted_contents
  @input.scan(/[^\\"]+/) and @input.matched
end

def parse_boost
  if @input.scan(/\^/)
    boost = @input.scan(/\d+/)
    raise 'missing boost value' if boost.nil?
    boost.to_i
  else
    nil
  end
end

def trim_space
  @input.scan(/\s+/)
end

Here is a non-robust example using StringScanner. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.

require 'strscan'

def test_parse
  text = %{Children^10 Health "sanitation management"^5}
  expected = [{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]


  assert_equal(expected, parse(text))
end

def parse(text)
  @input = StringScanner.new(text)

  output = []

  while keyword = parse_string || parse_quoted_string
    output << {
      :keywords => keyword,
      :boost => parse_boost
    }
    trim_space
  end

  output
end

def parse_string
  if @input.scan(/\w+/)
    @input.matched.downcase
  else
    nil
  end
end

def parse_quoted_string
  if @input.scan(/"/)
    str = parse_quoted_contents
    @input.scan(/"/) or raise "unclosed string"
    str
  else
    nil
  end
end

def parse_quoted_contents
  @input.scan(/[^\\"]+/) and @input.matched
end

def parse_boost
  if @input.scan(/\^/)
    boost = @input.scan(/\d+/)
    raise 'missing boost value' if boost.nil?
    boost.to_i
  else
    nil
  end
end

def trim_space
  @input.scan(/\s+/)
end

回复收藏 0 原文

作死小能手 2024-07-24 01:08:21

这里有一个任意语法，要解析它，您真正需要的是词法分析器 - 您可以编写一个描述语法的语法文件，然后使用词法分析器从您的语法生成递归解析器。

编写词法分析器（甚至递归解析器）并不是一件简单的事 - 尽管它是编程中的一项有用练习 - 但您可以在此电子邮件中找到 Ruby 词法分析器/解析器的列表：http://newsgroups .derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html

LR 解析器：
http://raa.ruby-lang.org/project/racc/ （ruby 中的运行时扩展
1.8)

http://raa.ruby-lang.org/project/rockit/（在 1.8 上不起作用？）

LL解析器：http://raa.ruby-lang.org/project /coco-rb/
http://rubyforge.org/projects/tdp4r/
http://rubyforge.org/projects/coco-ruby/
http://rubyforge.org/projects/grammar/