Code Golf:正则表达式解析器
今天的 Code Golf 挑战的目标
是用尽可能少的字符创建一个正则表达式解析器。
语法
不,我不是要求您匹配 Perl 风格的正则表达式。毕竟,已经有一个非常可靠的翻译了! :-)
以下是您需要了解的有关此挑战的正则表达式语法的所有信息:
- 术语 定义为单个文字字符,或分组括号
()
内的正则表达式。 *
(星号)字符表示对前一个 TERM 的Kleene 星形操作。这意味着零个或多个前一项连接在一起。+
(加号)字符表示一种方便的快捷方式:a+
相当于aa*
,表示前一项的一个或多个。?
(问号)字符表示前一项中的零个或一个。|
(竖线)字符表示交替,这意味着任意一侧的正则表达式都可以在匹配中使用。- 所有其他字符均假定为原义字符。您可以假设所有其他字符都在
[0-9A-Za-z]
范围内(即所有英文字母数字)。
或者,换句话说:*
/+
/?
具有最高优先级,然后是串联,然后是交替。由于交替的优先级低于串联,因此在不带括号的正则表达式中使用它会导致它绑定到每一侧的完整正则表达式。另一方面,*
和 +
以及 ?
仅适用于前一个术语。
挑战
您的挑战是编写一个程序来编译或解释正则表达式(如上面所定义),然后针对它测试多个字符串。
我将把意见留给你。我的建议是,正则表达式应该首先出现,然后是任意数量的字符串来对其进行测试;但如果你想让它持续下去,那也没关系。如果您想将所有内容放入命令行参数或 stdin 中,或者命令行中的正则表达式和 stdin 中的字符串,或者其他任何内容,都可以。仅显示一两个用法示例。
输出应为 true
或 false
,每行一个,以反映正则表达式是否匹配。
注释:
- 我不需要这么说...但是不要在您的语言中使用任何正则表达式库!您需要自己编译或解释该模式。 (编辑:如果需要拆分或连接字符串,您可以使用正则表达式。您只是不能使用它直接解决问题,例如,将输入正则表达式转换为语言正则表达式并使用它。 )
- 正则表达式必须完全匹配此挑战的输入字符串。 (同样,如果您熟悉类似 Perl 的正则表达式,请假设所有匹配项都具有字符串开头和结尾锚定)
- 对于此挑战,所有特殊字符
()*+?|
预计不会按字面意思出现。如果输入中出现一个,则可以安全地假设没有模式可以与相关字符串匹配。 - 要测试的输入字符串应以区分大小写的方式进行评估。
示例
对于示例,我假设一切都是在命令行参数中完成的,首先是正则表达式。 (正如我上面所说,输入由您决定。)这里的 myregex
代表您对程序的调用。
> myregex easy easy Easy hard
true
false
false
> myregex ab*a aa abba abab b
true
true
false
false
> myregex 0*1|10 1 10 0110 00001
true
true
false
true
> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true
> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true
注意:抱歉,忘记创建社区维基! :-(
The goal
Today's Code Golf challenge is to create a regex parser in as few characters as possible.
The syntax
No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)
Here's all you need to know about regex syntax for this challenge:
- A term is defined as a single literal character, or a regular expression within grouping parentheses
()
. - The
*
(asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together. - The
+
(plus) character represents a convenient shortcut:a+
is equivalent toaa*
, meaning one or more of the previous term. - The
?
(question mark) character represents zero or one of the previous term. - The
|
(pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match. - All other characters are assumed to be literal. You may assume that all other characters are within
[0-9A-Za-z]
(i.e., all English alphanumerics).
Or, put another way: *
/+
/?
have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. *
and +
and ?
, on the other hand, would just apply to the immediately preceding term.
The challenge
Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.
I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.
Output should be true
or false
, one per line, to reflect whether or not the regex matches.
Notes:
- I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. (Edit: You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
- The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
- For this challenge, all of the special characters
()*+?|
are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question. - Input strings to test should be evaluated in a case-sensitive manner.
The examples
For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex
here represents your invocation of the program.
> myregex easy easy Easy hard
true
false
false
> myregex ab*a aa abba abab b
true
true
false
false
> myregex 0*1|10 1 10 0110 00001
true
true
false
true
> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true
> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true
NOTE: Sorry, forgot to make community wiki! :-(
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
C(解释),
630622617615582576< /罢工> <罢工> 573 <罢工> 572 <罢工> 558 <罢工> 554 <罢工> 544 <罢工> 538529504502500499 个字符这可以理解括号,并且适用于正则表达式(即不首先解析) )
使用 -Wall 进行编译会抱怨 argc 为 char。遇到非法模式会崩溃。
这从右到左解析正则表达式和字符串,节省了一些字符。
在argv上输入,按
reverse正常顺序输出:C (interpreting),
630622617615582576573572558554544538529504502500499 charsThis understands parentheses, and works on the regexp live (i.e. not parsed first)
compiling with -Wall complains about argc being char. Will crash on illegal patterns.
This parses regexp and string from right to left, saving a few chars.
input on argv, output in
reversenormal order:GolfScript - 254 个字符
有点简单,第一个循环将正则表达式转换为 NFA,第二个循环运行该 NFA。
Sun Aug 22 00:58:24 EST 2010
(271→266) 更改了变量名称以删除空格Sun Aug 22 01:07:11 EST 2010
(266→265) 将[]
设为变量Sun Aug 22 07:05:50 EST 2010
(265→259) 内联进行空状态转换Sun Aug 22 07:19:21 EST 2010
(259→256) 最终状态隐式2011 年 2 月 7 日星期一 19:24:19 EST 2011
(256→254) 使用"()""str"*
GolfScript - 254 chars
Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.
Sun Aug 22 00:58:24 EST 2010
(271→266) changed variable names to remove spacesSun Aug 22 01:07:11 EST 2010
(266→265) made[]
a variableSun Aug 22 07:05:50 EST 2010
(265→259) made null state transitions inlineSun Aug 22 07:19:21 EST 2010
(259→256) final state made implicitMon Feb 7 19:24:19 EST 2011
(256→254) using"()""str"*
C (parsing+matching)
727670 个字符这还没有完全达到最低限度,但我想尝试看看首先解析正则表达式是否会对实时解释它产生影响。确实如此,因为它的成本更高,尽管解析和匹配都更容易编写/理解。
它将正则表达式解析为一个结构体,其中每个结构体具有:
*
或?
--(pat)+
被解析为 <立即 code>(pat)(pat)* 这使得匹配变得更加容易C (parsing+matching)
727670 charsThis is not fully golfed to the bare minimum yet, but I wanted to try and see if parsing the regexp first would make a difference to interpreting it live. It does, because it costs more, although both parsing and matching are easier to write/understand.
it parses a regexp into a struct, where each struct has:
*
or?
--(pat)+
is parsed into(pat)(pat)*
right away which makes matching far easierHaskell: 610
612627Ungolfed:
备注:
将
|
修改为后缀形式,然后反转整个正则表达式。现在运算符采用前缀形式,使其易于解析。该程序像这样解析正则表达式。列表单子用于不确定性。用法:
Haskell: 610
612627Ungolfed:
Memo:
Modifies
|
to a suffix form and then reverses the entire regex. Now the operators are in prefix form, making it easy to parse. The program parses the regex like this. The list monad is used for nondeterminism.Usage:
Ruby 1.9:709 个字符
(通过添加下面的别名,这也适用于 Ruby 1.8,具有 45 个以上的字符)
使用
type testcase.txt | 进行测试。 ruby regex.rb
Ruby 中 Russ Cox 的 NFA 解析器 的实现。状态被表示为具有单个键的散列,该键保存下一个状态的数组。
未打高尔夫球:
Ruby 1.9: 709 chars
(This will also works in Ruby 1.8 with 45 more chars by adding the alias below)
Test with
type testcase.txt | ruby regex.rb
An implementation of Russ Cox's NFA parser in Ruby. A state is represented as a hash with a single key which holds an array of next states.
Ungolfed:
Perl,596 个字符
半解释:
S { block }
与sub { block }
相同,只是每次短 2 个字符。$,
为 nil(包含始终匹配的匹配器的 coderef,不消耗任何内容)c
是 n 元串联(采用任意数量的匹配器并返回一个成功的匹配器,如果他们都按顺序成功)。a
是 n 元交替(采用任意数量的匹配器,如果其中任何一个匹配器成功,则返回一个成功的匹配器)。A
是正则表达式构建器的助手 - 它采用交替连接的结构,并根据需要传递给C
和a
,返回一个匹配者。k
是星号(获取一个匹配器并返回一个按顺序匹配它 0 次或多次的匹配器。k
对于 Kleene,因为s()
被解析为 s/// 运算符:)@$r
是当前串联列表,@a
是当前交替列表,@p
是 paren-group 堆栈。a+
被视为aa*
,而a?
在中被视为内联
((a|)
bplus
或maybe
没有函数)。S{!length pop}
是一个在输入结束时成功的匹配器。它作为正则表达式匹配器的延续传递,这意味着正则表达式只有在可以匹配整个字符串时才会成功。对于大部分经过脱节和更多注释的代码,请参阅此要点。
将其运行为
perl regexer.pl 'a?b+|(a+b|b+a?)+' abb babab aaa aabba a b
。代码中没有强制换行符。Perl, 596 chars
Semi-explanation:
S { block }
is the same assub { block }
only it's 2 chars shorter every time.$,
is nil (a coderef containing a matcher that always matches, consuming nothing)c
is n-ary concatenation (take any number of matchers and return a matcher that succeeds if they all succeed in sequence).a
is n-ary alternation (take any number of matchers and return a matcher that succeeds if any of them succeed).A
is a helper for the regex builder — it takes a structure of alternations-of-concatenations and passes toC
anda
as necessary, returning a matcher.k
is star (take a matcher and return a matcher that matches it 0 or more times in sequence.k
for Kleene, becauses()
is parsed as thes///
operator :)do
block parses the regex a char at a time.@$r
is the current concatenation list,@a
is the current alternation list,@p
is the paren-group stack.a+
is treated asaa*
, anda?
is treated as(a|)
inline inb
(there are no functions forplus
ormaybe
).S{!length pop}
in the last line is a matcher that succeeds at end-of-input. It's passed as the continuation for the regex matcher, which means that the regex only succeeds if it can match the entire string.For mostly-degolfed and more-commented code, see this Gist.
Run it as
perl regexer.pl 'a?b+|(a+b|b+a?)+' abb babab aaa aabba a b
. No mandatory linebreaks in the code.JavaScript,658 个字符
Ungolfed,约 1500 个字符
它通过递归、砍掉正则表达式的前面和字符串的前面并调用自身来工作。例如,
test("hello", "hello") =>测试(“你好”,“你好”)=>测试(“llo”,“llo”)=>测试(“lo”,“lo”)=>测试(“o”,“o”)=> test("", "")
返回 true。注意:裸c
函数不支持隐式交替。换句话说,hello|hi
将不起作用;它必须用括号括起来:(hello|hi)
。JavaScript, 658 chars
Ungolfed, ~1500 chars
This works by recursing, by chopping off the front of regex and the front of the string, and calling itself. For example,
test("hello", "hello") => test("ello", "ello") => test("llo", "llo") => test("lo", "lo") => test("o", "o") => test("", "")
returns true. Note: the barec
function will not support implied alternation. In other words,hello|hi
will not work; it has to be wrapped in parentheses:(hello|hi)
.