如何在 Ruby 中标记该字符串?
我有这个字符串:
%{Children^10 Health "sanitation management"^5}
我想将其转换为将其标记为哈希数组:
[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]
我知道 StringScanner 和 语法 gem 但我找不到足够的代码示例。
有什么指点吗?
I have this string:
%{Children^10 Health "sanitation management"^5}
And I want to convert it to tokenize this into an array of hashes:
[{:keywords=>"children", :boost=>10}, {:keywords=>"health", :boost=>nil}, {:keywords=>"sanitation management", :boost=>5}]
I'm aware of StringScanner and the Syntax gem but I can't find enough code examples for both.
Any pointers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对于真正的语言,词法分析器是正确的选择 - 正如格斯所说。 但是,如果完整的语言仅像您的示例一样复杂,您可以使用以下快速技巧:
如果您尝试解析常规语言,那么此方法就足够了 - 尽管不需要更多的复杂性即可使该语言成为非-常规的。
正则表达式的快速细分:
\w+
匹配任何单项关键字(?:\\.|[^\\"]])*
使用非捕获括号((?:...)
) 匹配转义双引号字符串的内容 - 转义符号 (\n
,\"
、\\
等)或任何不是转义符号或结束引号的单个字符。"((?:\\.|[^\\"]])*)"
仅捕获带引号的关键字短语的内容。(?:(\w+)|"( (?:\\.|[^\\"])*)")
匹配任何关键字 - 单个术语或短语,将单个术语捕获到$1
并将短语内容捕获到$2
\d+
匹配数字。\^(\d+)
捕获插入符号 (^
) 后面的数字。 由于这是第三组捕获括号,因此它将被捕获到$3
中。(?:\^(\d+))?
捕获插入符号后面的数字(如果存在),否则匹配空字符串。String#scan(regex)
将正则表达式与字符串进行尽可能多次的匹配,输出“匹配”数组。 如果正则表达式包含捕获括号,则“匹配”是捕获的项目数组 - 因此$1
变为match[0]
,$2
变为 < code>match[1] 等。任何未与字符串部分匹配的捕获括号都会映射到结果“match”中的nil
条目。然后,#map 获取这些匹配项,使用一些块魔法将每个捕获的术语分解为不同的变量(我们可以这样做
do |match| ; word,phrase,boost = *match),然后创建您想要的哈希值。
word
或phrase
中的一个将为nil
,因为两者都无法与输入匹配,因此(word || phrase)
将返回非nil
1,而#downcase
会将其转换为全部小写。boost.to_i
会将字符串转换为整数,而(boost.nil? ? nil : boost.to_i)
将确保nil
提升保持不变无
。For a real language, a lexer's the way to go - like Guss said. But if the full language is only as complicated as your example, you can use this quick hack:
If you're trying to parse a regular language then this method will suffice - though it wouldn't take many more complications to make the language non-regular.
A quick breakdown of the regex:
\w+
matches any single-term keywords(?:\\.|[^\\"]])*
uses non-capturing parentheses ((?:...)
) to match the contents of an escaped double quoted string - either an escaped symbol (\n
,\"
,\\
, etc.) or any single character that's not an escape symbol or an end quote."((?:\\.|[^\\"]])*)"
captures only the contents of a quoted keyword phrase.(?:(\w+)|"((?:\\.|[^\\"])*)")
matches any keyword - single term or phrase, capturing single terms into$1
and phrase contents into$2
\d+
matches a number.\^(\d+)
captures a number following a caret (^
). Since this is the third set of capturing parentheses, it will be caputred into$3
.(?:\^(\d+))?
captures a number following a caret if it's there, matches the empty string otherwise.String#scan(regex)
matches the regex against the string as many times as possible, outputing an array of "matches". If the regex contains capturing parens, a "match" is an array of items captured - so$1
becomesmatch[0]
,$2
becomesmatch[1]
, etc. Any capturing parenthesis that doesn't get matched against part of the string maps to anil
entry in the resulting "match".The
#map
then takes these matches, uses some block magic to break each captured term into different variables (we could have donedo |match| ; word,phrase,boost = *match
), and then creates your desired hashes. Exactly one ofword
orphrase
will benil
, since both can't be matched against the input, so(word || phrase)
will return the non-nil
one, and#downcase
will convert it to all lowercase.boost.to_i
will convert a string to an integer while(boost.nil? ? nil : boost.to_i)
will ensure thatnil
boosts staynil
.这是一个使用 StringScanner 的非稳健示例。 这是我刚刚改编自 Ruby Quiz: Parsing JSON 的代码,其中有很好的解释。
Here is a non-robust example using
StringScanner
. This is code I just adapted from Ruby Quiz: Parsing JSON, which has an excellent explanation.这里有一个任意语法,要解析它,您真正需要的是词法分析器 - 您可以编写一个描述语法的语法文件,然后使用词法分析器从您的语法生成递归解析器。
编写词法分析器(甚至递归解析器)并不是一件简单的事 - 尽管它是编程中的一项有用练习 - 但您可以在此电子邮件中找到 Ruby 词法分析器/解析器的列表:http://newsgroups .derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html
RACC 可作为标准模块使用Ruby 1.8,所以我建议你集中精力,即使它的手册不太容易理解并且需要熟悉 yacc。
What you have here is an arbitrary grammar, and to parse it what you really want is a lexer - you can write a grammar file that described your syntax and then use the lexer to generate a recursive parser from your grammar.
Writing a lexer (or even a recursive parser) is not really trivial - although it is a useful exercise in programming - but you can find a list of Ruby lexers/parsers in this email message here: http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.ruby/2005-11/msg02233.html
RACC is available as a standard module of Ruby 1.8, so I suggest you concentrate on that even if its manual is not really easy to follow and it requires familiarity with yacc.