当前位置：文江博客话题详情

Code Golf：正则表达式解析器

发布于 2024-09-14 18:09:11 字数 1926 浏览 10 评论 0原文

今天的 Code Golf 挑战的目标

是用尽可能少的字符创建一个正则表达式解析器。

语法

不，我不是要求您匹配 Perl 风格的正则表达式。毕竟，已经有一个非常可靠的翻译了！ :-)

以下是您需要了解的有关此挑战的正则表达式语法的所有信息：

术语定义为单个文字字符，或分组括号 () 内的正则表达式。
*（星号）字符表示对前一个 TERM 的Kleene 星形操作。这意味着零个或多个前一项连接在一起。
+（加号）字符表示一种方便的快捷方式：a+ 相当于 aa*，表示前一项的一个或多个。
?（问号）字符表示前一项中的零个或一个。
|（竖线）字符表示交替，这意味着任意一侧的正则表达式都可以在匹配中使用。
所有其他字符均假定为原义字符。您可以假设所有其他字符都在 [0-9A-Za-z] 范围内（即所有英文字母数字）。

或者，换句话说：*/+/? 具有最高优先级，然后是串联，然后是交替。由于交替的优先级低于串联，因此在不带括号的正则表达式中使用它会导致它绑定到每一侧的完整正则表达式。另一方面，* 和 + 以及 ? 仅适用于前一个术语。

挑战

您的挑战是编写一个程序来编译或解释正则表达式（如上面所定义），然后针对它测试多个字符串。

我将把意见留给你。我的建议是，正则表达式应该首先出现，然后是任意数量的字符串来对其进行测试；但如果你想让它持续下去，那也没关系。如果您想将所有内容放入命令行参数或 stdin 中，或者命令行中的正则表达式和 stdin 中的字符串，或者其他任何内容，都可以。仅显示一两个用法示例。

输出应为 true 或 false，每行一个，以反映正则表达式是否匹配。

注释：

我不需要这么说...但是不要在您的语言中使用任何正则表达式库！您需要自己编译或解释该模式。（编辑：如果需要拆分或连接字符串，您可以使用正则表达式。您只是不能使用它直接解决问题，例如，将输入正则表达式转换为语言正则表达式并使用它。 )
正则表达式必须完全匹配此挑战的输入字符串。（同样，如果您熟悉类似 Perl 的正则表达式，请假设所有匹配项都具有字符串开头和结尾锚定）
对于此挑战，所有特殊字符 ()*+?| 预计不会按字面意思出现。如果输入中出现一个，则可以安全地假设没有模式可以与相关字符串匹配。
要测试的输入字符串应以区分大小写的方式进行评估。

示例

对于示例，我假设一切都是在命令行参数中完成的，首先是正则表达式。（正如我上面所说，输入由您决定。）这里的 myregex 代表您对程序的调用。

> myregex easy easy Easy hard
true
false
false

> myregex ab*a aa abba abab b
true
true
false
false

> myregex 0*1|10 1 10 0110 00001
true
true
false
true

> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true

> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true

注意：抱歉，忘记创建社区维基！ :-(

原文

The goal

Today's Code Golf challenge is to create a regex parser in as few characters as possible.

The syntax

No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)

Here's all you need to know about regex syntax for this challenge:

A term is defined as a single literal character, or a regular expression within grouping parentheses ().
The * (asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together.
The + (plus) character represents a convenient shortcut: a+ is equivalent to aa*, meaning one or more of the previous term.
The ? (question mark) character represents zero or one of the previous term.
The | (pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match.
All other characters are assumed to be literal. You may assume that all other characters are within [0-9A-Za-z] (i.e., all English alphanumerics).

Or, put another way: */+/? have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. * and + and ?, on the other hand, would just apply to the immediately preceding term.

The challenge

Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.

I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.

Output should be true or false, one per line, to reflect whether or not the regex matches.

Notes:

I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. (Edit: You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
For this challenge, all of the special characters ()*+?| are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question.
Input strings to test should be evaluated in a case-sensitive manner.

The examples

For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex here represents your invocation of the program.

> myregex easy easy Easy hard
true
false
false

> myregex ab*a aa abba abab b
true
true
false
false

> myregex 0*1|10 1 10 0110 00001
true
true
false
true

> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true

> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true

NOTE: Sorry, forgot to make community wiki! :-(

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迟到的我 2024-09-21 18:09:11

C（解释），630 622 617 615 582 576< /罢工> <罢工> 573 <罢工> 572 <罢工> 558 <罢工> 554 <罢工> 544 <罢工> 538 529 504 502 500 499 个字符

这可以理解括号，并且适用于正则表达式（即不首先解析））

#define C char
#define M m(s,o
m(C*s,C*o,C*S,C*p,C*P,C T){C*n=P-1,*q=s,h=*P==41,c=1;for(;h*c;c-=*n--==40)c+=*n==41;
c=*P-42;c=p-P?c-82?T&&c&~1&&c-21?h?2:*S==*P&s<S?M,S-1,p,n,2)||(T&4&&M,S-1,p,P,T|1)):
4:M,T?S:o,p,P-1,T?c&~1?3:7-c:0):T&&s==S||M,o,p,P-1,2):T&&s==S;if(c==2)for(c=4;q<=S;q
++)c|=m(q,S,S,n+h,P-h,2)?M,q,p,n,2)||q<S&T/4&&M,q,p,P,T&6):0;return
c&4?c&1?:T&1&&M,S,p,n,2)||M,o,p,n,0):c;}main(C*w,C**v){C*u;for(w=*++v;*++v;)puts(m(*v
-1,u,u=index(*v,0)-1,w-1,index(w,0)-1,2)?"true":"false");}

使用 -Wall 进行编译会抱怨 argc 为 char。遇到非法模式会崩溃。

这从右到左解析正则表达式和字符串，节省了一些字符。

在argv上输入，按~~reverse~~正常顺序输出：

$ ./a.out ab*a aa abba abab b
true
true
false
false

$ ./a.out "0*1|10" 1 10 0110 00001
true
true
false
true

$ ./a.out "0*(1|1+0)" 1 10 0110 00001
true
true
true
true

$ ./a.out "a?b+|(a+b|b+a?)+" abb babab aaa aabba a b
true
true
false
true
false
true

$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbb
false
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbbb
true

C (interpreting), 630 622 617 615 582 576 573 572 558 554 544 538 529 504 502 500 499 chars

This understands parentheses, and works on the regexp live (i.e. not parsed first)

#define C char
#define M m(s,o
m(C*s,C*o,C*S,C*p,C*P,C T){C*n=P-1,*q=s,h=*P==41,c=1;for(;h*c;c-=*n--==40)c+=*n==41;
c=*P-42;c=p-P?c-82?T&&c&~1&&c-21?h?2:*S==*P&s<S?M,S-1,p,n,2)||(T&4&&M,S-1,p,P,T|1)):
4:M,T?S:o,p,P-1,T?c&~1?3:7-c:0):T&&s==S||M,o,p,P-1,2):T&&s==S;if(c==2)for(c=4;q<=S;q
++)c|=m(q,S,S,n+h,P-h,2)?M,q,p,n,2)||q<S&T/4&&M,q,p,P,T&6):0;return
c&4?c&1?:T&1&&M,S,p,n,2)||M,o,p,n,0):c;}main(C*w,C**v){C*u;for(w=*++v;*++v;)puts(m(*v
-1,u,u=index(*v,0)-1,w-1,index(w,0)-1,2)?"true":"false");}

compiling with -Wall complains about argc being char. Will crash on illegal patterns.

This parses regexp and string from right to left, saving a few chars.

input on argv, output in ~~reverse~~ normal order:

$ ./a.out ab*a aa abba abab b
true
true
false
false

$ ./a.out "0*1|10" 1 10 0110 00001
true
true
false
true

$ ./a.out "0*(1|1+0)" 1 10 0110 00001
true
true
true
true

$ ./a.out "a?b+|(a+b|b+a?)+" abb babab aaa aabba a b
true
true
false
true
false
true

$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbb
false
$ ./a.out "((A|B|C)+(a|(bbbbb)|bb|c)+)+" ABCABCaccabbbbbaACBbbbb
true

回复收藏 0 原文

染柒℉ 2024-09-21 18:09:11

GolfScript - 254 个字符

n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%

有点简单，第一个循环将正则表达式转换为 NFA，第二个循环运行该 NFA。

Sun Aug 22 00:58:24 EST 2010 (271→266) 更改了变量名称以删除空格
Sun Aug 22 01:07:11 EST 2010 (266→265) 将 [] 设为变量
Sun Aug 22 07:05:50 EST 2010 (265→259) 内联进行空状态转换
Sun Aug 22 07:19:21 EST 2010 (259→256) 最终状态隐式
2011 年 2 月 7 日星期一 19:24:19 EST 2011 (256→254) 使用 "()""str"*

$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false

$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true

$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true

$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true

$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true

GolfScript - 254 chars

n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%

Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.

Sun Aug 22 00:58:24 EST 2010 (271→266) changed variable names to remove spaces
Sun Aug 22 01:07:11 EST 2010 (266→265) made [] a variable
Sun Aug 22 07:05:50 EST 2010 (265→259) made null state transitions inline
Sun Aug 22 07:19:21 EST 2010 (259→256) final state made implicit
Mon Feb 7 19:24:19 EST 2011 (256→254) using "()""str"*

$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false

$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true

$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true

$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true

$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true

回复收藏 0 原文

黯淡〆 2024-09-21 18:09:11

C (parsing+matching) 727 670 个字符

这还没有完全达到最低限度，但我想尝试看看首先解析正则表达式是否会对实时解释它产生影响。确实如此，因为它的成本更高，尽管解析和匹配都更容易编写/理解。

#define Q struct q
#define C char
#define R return
Q{Q*u,*n,*a;C c,m};Q*P(C*p,C*e){Q*r=calloc(99,1);C*n=p+1,c=1,w;if(p==e)R
r;if(*p==40){for(;c;)c+=(*n==40)-(*n++==41);r->u=P(p+1,n-1);}else
if(*p=='|'){r->a=P(p+1,e);R r;}else r->c=*p;if(n<e){if(*n==43)*n=42,r->n=P(p,e);else
w=*n==42|*n==63,r->n=P(n+w,e),r->m=w?*n:0;r->a=r->n->a;}R r;}M(Q*r,C*s,C*o,C*z){C*p,
e;e=r?r->m==63|r->m==42&&M(r->n,s,o,z)?:*s&&r->c==*s?M(r->m==42?r:r->n,s+1,o,z):2:s
==z;if(e-2)R e;for(p=s,e=0;!r->c&p<=z;p++)e|=M(r->u,s,s,p)&(r->m!=42|p>s)&&M(r->m==
42?r:r->n,p,p,z);R e||r->a&&M(r->a,o,o,z);}main(C
c,C**v){for(;--c>1;)puts(M(P(v[1],index(v[1],0)),v[c],v[c],index(v[c],0))?"true":"false");}

它将正则表达式解析为一个结构体，其中每个结构体具有：

要匹配的字符或指向要匹配的子模式结构体的指针
指向下一个符号结构体（如果有）的
指针一个指针对于下一个替代模式（如果有），
乘数为 \0、* 或 ? -- (pat)+ 被解析为 <立即 code>(pat)(pat)* 这使得匹配变得更加容易

C (parsing+matching) 727 670 chars

This is not fully golfed to the bare minimum yet, but I wanted to try and see if parsing the regexp first would make a difference to interpreting it live. It does, because it costs more, although both parsing and matching are easier to write/understand.

#define Q struct q
#define C char
#define R return
Q{Q*u,*n,*a;C c,m};Q*P(C*p,C*e){Q*r=calloc(99,1);C*n=p+1,c=1,w;if(p==e)R
r;if(*p==40){for(;c;)c+=(*n==40)-(*n++==41);r->u=P(p+1,n-1);}else
if(*p=='|'){r->a=P(p+1,e);R r;}else r->c=*p;if(n<e){if(*n==43)*n=42,r->n=P(p,e);else
w=*n==42|*n==63,r->n=P(n+w,e),r->m=w?*n:0;r->a=r->n->a;}R r;}M(Q*r,C*s,C*o,C*z){C*p,
e;e=r?r->m==63|r->m==42&&M(r->n,s,o,z)?:*s&&r->c==*s?M(r->m==42?r:r->n,s+1,o,z):2:s
==z;if(e-2)R e;for(p=s,e=0;!r->c&p<=z;p++)e|=M(r->u,s,s,p)&(r->m!=42|p>s)&&M(r->m==
42?r:r->n,p,p,z);R e||r->a&&M(r->a,o,o,z);}main(C
c,C**v){for(;--c>1;)puts(M(P(v[1],index(v[1],0)),v[c],v[c],index(v[c],0))?"true":"false");}

it parses a regexp into a struct, where each struct has:

the character to be matched or a pointer to a struct of a subpattern to be matched
a pointer to the next symbol struct (if any)
a pointer to the next alternative pattern (if any)
a multiplier which is \0, * or ? -- (pat)+ is parsed into (pat)(pat)* right away which makes matching far easier

回复收藏 0 原文

邮友 2024-09-21 18:09:11

Haskell: 610 612 627

main=getLine>>=f.words
d=reverse
u=0<1
j=[]
f(r:s)=mapM_(print.any null.c(d$b
 Ungolfed:
import Control.Monad
import Data.List

-- (aa|bb|cc)   -->  )|)cc()|)bb()aa((( 

testRegex = "a?b+|(a+b|b+a?)+"

interpret regex = any null . interpret' regex

interpret' regex = compile (rewrite regex)

mapFst f (x, y) = (f x, y)

splitWhileBalanced _ _ _ "" = ("", "")
splitWhileBalanced n b1 b2 (x:xs)
  | n < 1 && x == b1 = ("", x:xs)
  | x == b1   = f (-1)
  | x == b2   = f 1
  | otherwise = f 0
  where
    f k = mapFst (x:) $ splitWhileBalanced (n+k) b1 b2 xs

rewrite regex = reverse $ moveBars $ '(' : regex ++ ")"

moveBars regex
  | mtBar == "" = regex
  | otherwise = moveBars $ reverse (hOpen ++ "((" ++ tOpen) ++ ")(" ++ hClose ++ ")/)" ++ tClose
  where
    (hBar, mtBar) = splitWhileBalanced 0 '|' '!' regex -- '!' is a dummy character
    b:tBar = mtBar
    (hOpen, o:tOpen) = splitWhileBalanced 0 '(' ')' $ reverse hBar
    (hClose, c:tClose) = splitWhileBalanced 0 ')' '(' $ tBar

compile "" = \x -> [x]
compile rs = f <=< compile rs'
  where 
    (rs', f) = compile' rs

paren regex@(r:rs0)
  | r == '('  = (rs0, \x -> [x])
  | otherwise = (rs2, f <=< g)
  where
    (rs1, f) = compile' regex
    (rs2, g) = paren rs1

compile' (r:rs0) = case r of
  '/' -> (rs2, bar)
  '*' -> (rs1, star)
  '+' -> (rs1, plus)
  '?' -> (rs1, mark)
  ')' -> paren rs0
  _   -> (rs0, lit)
  where
    (rs1, f) = compile' rs0
    (rs2, g) = compile' rs1
    bar str = f str ++ g str
    plus str
      | null (f str) = []
      | otherwise = f str ++ (f str >>= plus)
    star str = str : plus str
    mark str = str : f str
    lit = maybe [] (\x -> [x]) . stripPrefix [r]

备注：
将 | 修改为后缀形式，然后反转整个正则表达式。现在运算符采用前缀形式，使其易于解析。该程序像这样解析正则表达式。列表单子用于不确定性。
用法：
> runghc Regex.hs
a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
True
True
False
True
False
True

(':r++")"))s
c%(x,y)=(c:x,y)
s _ _ _[]=(j,j)
s n a b (x:y)|n<1&&x==a=(j,x:y)|x==a=f(-1)|x==b=f 1|u=f 0where f k=x%s(n+k)a b y
b r|m==j=r|u=b$d(o++"(("++x)++")("++z++")/)"++w where(c,m)=s 0'|''!'r;_:v=m;(o,_:x)=s 0'('')'$d c;(z,_:w)=s 0')''('v
(!)g f x=f x>>=g
c[]=(:j)
c r=f!c s where(s,f)=i r
p q@(r:s)|r=='('=(s,(:j))|u=(a,f!g)where(w,f)=i q;(a,g)=p w
_?[]=j
c?(h:r)|c==h=[r]|u=j
i(r:q)=maybe(q,(r?))id$lookup r$zip")/*+?"$p q:zip[e,w,w,w][\s->f s++g s,\s->s:l s,l,\s->s:f s]where(w,f)=i q;(e,g)=i w;l s|f s==j=j|u=f s++(f s>>=l)

Ungolfed:

备注：

将 | 修改为后缀形式，然后反转整个正则表达式。现在运算符采用前缀形式，使其易于解析。该程序像这样解析正则表达式。列表单子用于不确定性。

用法：

Haskell: 610 612 627

main=getLine>>=f.words
d=reverse
u=0<1
j=[]
f(r:s)=mapM_(print.any null.c(d$b
Ungolfed:
import Control.Monad
import Data.List

-- (aa|bb|cc)   -->  )|)cc()|)bb()aa((( 

testRegex = "a?b+|(a+b|b+a?)+"

interpret regex = any null . interpret' regex

interpret' regex = compile (rewrite regex)

mapFst f (x, y) = (f x, y)

splitWhileBalanced _ _ _ "" = ("", "")
splitWhileBalanced n b1 b2 (x:xs)
  | n < 1 && x == b1 = ("", x:xs)
  | x == b1   = f (-1)
  | x == b2   = f 1
  | otherwise = f 0
  where
    f k = mapFst (x:) $ splitWhileBalanced (n+k) b1 b2 xs

rewrite regex = reverse $ moveBars $ '(' : regex ++ ")"

moveBars regex
  | mtBar == "" = regex
  | otherwise = moveBars $ reverse (hOpen ++ "((" ++ tOpen) ++ ")(" ++ hClose ++ ")/)" ++ tClose
  where
    (hBar, mtBar) = splitWhileBalanced 0 '|' '!' regex -- '!' is a dummy character
    b:tBar = mtBar
    (hOpen, o:tOpen) = splitWhileBalanced 0 '(' ')' $ reverse hBar
    (hClose, c:tClose) = splitWhileBalanced 0 ')' '(' $ tBar

compile "" = \x -> [x]
compile rs = f <=< compile rs'
  where 
    (rs', f) = compile' rs

paren regex@(r:rs0)
  | r == '('  = (rs0, \x -> [x])
  | otherwise = (rs2, f <=< g)
  where
    (rs1, f) = compile' regex
    (rs2, g) = paren rs1

compile' (r:rs0) = case r of
  '/' -> (rs2, bar)
  '*' -> (rs1, star)
  '+' -> (rs1, plus)
  '?' -> (rs1, mark)
  ')' -> paren rs0
  _   -> (rs0, lit)
  where
    (rs1, f) = compile' rs0
    (rs2, g) = compile' rs1
    bar str = f str ++ g str
    plus str
      | null (f str) = []
      | otherwise = f str ++ (f str >>= plus)
    star str = str : plus str
    mark str = str : f str
    lit = maybe [] (\x -> [x]) . stripPrefix [r]

Memo:
Modifies | to a suffix form and then reverses the entire regex. Now the operators are in prefix form, making it easy to parse. The program parses the regex like this. The list monad is used for nondeterminism.
Usage:
> runghc Regex.hs
a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
True
True
False
True
False
True

(':r++")"))s
c%(x,y)=(c:x,y)
s _ _ _[]=(j,j)
s n a b (x:y)|n<1&&x==a=(j,x:y)|x==a=f(-1)|x==b=f 1|u=f 0where f k=x%s(n+k)a b y
b r|m==j=r|u=b$d(o++"(("++x)++")("++z++")/)"++w where(c,m)=s 0'|''!'r;_:v=m;(o,_:x)=s 0'('')'$d c;(z,_:w)=s 0')''('v
(!)g f x=f x>>=g
c[]=(:j)
c r=f!c s where(s,f)=i r
p q@(r:s)|r=='('=(s,(:j))|u=(a,f!g)where(w,f)=i q;(a,g)=p w
_?[]=j
c?(h:r)|c==h=[r]|u=j
i(r:q)=maybe(q,(r?))id$lookup r$zip")/*+?"$p q:zip[e,w,w,w][\s->f s++g s,\s->s:l s,l,\s->s:f s]where(w,f)=i q;(e,g)=i w;l s|f s==j=j|u=f s++(f s>>=l)

Ungolfed:

Memo:

Modifies | to a suffix form and then reverses the entire regex. Now the operators are in prefix form, making it easy to parse. The program parses the regex like this. The list monad is used for nondeterminism.

Usage:

回复收藏 0 原文

温柔嚣张 2024-09-21 18:09:11

Ruby 1.9：709 个字符

R=gets.chop;s='';k=[];n=a=0
G={?(=>(A="(a-=1;s<<0)if a>1;")+"k<<[n,a];n=a=0",
Y=?|=>(B="s<<0while 0<a-=1;")+"n+=1",
?)=>B+(C="s<<?|while 0<=n-=1;")+"n,a=k.pop"+F=";a+=1",
?*=>D="s<<c",?+=>D,??=>D}
R.each_char{|c|eval G[c]||A+D+F};eval B+C
def P l,s;l.map{|a|a<<s};end
J={??=>(K="a=k.pop;")+"k<<[{Y=>n=[a[0]]},a[1]<<n]",
?*=>K+(H="P a[1],s={Y=>n=[a[0]]};")+"k<<[s,[n]]",
?+=>K+H+"k<<[a[0],[n]]",
Y=>(I=K+"b=k.pop;")+"k<<[{Y=>[a[0],b[0]]},a[1]+b[1]]",
?\0=>I+"P b[1],a[0];k<<[b[0],a[1]]"}
k=[];s.each_char{|c|eval J[c]||"k<<[{c=>a=[]},[a]]"}
e=k[0];P e[1],R;
p lt;.map{|l|s=l.chop;*a=e[0]
s.each_char{|c|eval@f="n=a;a=a.map{|h|h[Y]||[h]}.flatten"while a!=n
a=a.inject([]){|a,h|a+(h[c]||[])}}
eval@f;a.include? R}

（通过添加下面的别名，这也适用于 Ruby 1.8，具有 45 个以上的字符）

使用 type testcase.txt | 进行测试。 ruby regex.rb

Ruby 中 Russ Cox 的 NFA 解析器的实现。状态被表示为具有单个键的散列，该键保存下一个状态的数组。

未打高尔夫球：

#needed to run on ruby 1.8
class String
  alias :each_char :each_byte 
end  

## Infix to Postfix

R=gets.chop  
p "regexp = #{R}"
k=[] 
s=''  #will hold postfix string
n=a=0 #count braNches and Atoms
postfix = R.each_char{|c|
  case c
    when ?(
        (a-=1;s<<0)if a>1
        k<<[n,a]
        n=a=0
    when ?|
      s<<0while 0<a-=1
      n+=1
    when ?)
      s<<0while 0<a-=1
      s<<?|while 0<=n-=1
      n,a=k.pop;a+=1
    when ?*,?+,??
      s<< c
    else
      (a-=1;s<<0)if a>1
      s<< c
      a+=1
  end
}
s<<0while 0<a-=1
s<<?|while 0<=n-=1

## Postfix to NFA
# State is {char=>s0=[state,state]]
# Frag is [state, [s0,...]]

Y=?|   #choice indicator
X={:true=>[R]} #matcstate

 def patch l,s   #l is list of arrays, s is state.  Put s in each array
   l.map{|a|   a<< s   }
end

k=[]
s.each_char{|c|
 case c
    when ??
      a=k.pop
      s={Y=>n=[a[0]]}
      k<<[s,a[1]<<n]
    when ?*
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[s,[n]]
    when ?+
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[a[0],[n]]
    when ?|
     b=k.pop
     a=k.pop
      s={Y=>[a[0],b[0]]}
      k<<[s,a[1]+b[1]]
    when 0
     b=k.pop
     a=k.pop
     patch(a[1],b[0])
     k<<[a[0],b[1]]
   else
     k<<[{c=>a=[]},[a]]
   end
}
e=k.pop
patch(e[1],X)

#Running the NFA

E=[e[0]] #Evaluator
p lt;.map{|l|s=l.chop
  p "evaluating: #{s}" 
  a=E
  s.each_char{|c|
    begin     #skip past split nodes
      s=a.size
      a=a.map{|h|h[?|]||[h]}.flatten  
    end while a.size!=s
    a=a.inject([]){|a,h|
    a+(h[c]||[])}  #add next state or null
  }
  a=a.map{|h|h[?|]||[h]}.flatten
  r = a.include? X  #check for end of pattern
  p r
  r
}

Ruby 1.9: 709 chars

R=gets.chop;s='';k=[];n=a=0
G={?(=>(A="(a-=1;s<<0)if a>1;")+"k<<[n,a];n=a=0",
Y=?|=>(B="s<<0while 0<a-=1;")+"n+=1",
?)=>B+(C="s<<?|while 0<=n-=1;")+"n,a=k.pop"+F=";a+=1",
?*=>D="s<<c",?+=>D,??=>D}
R.each_char{|c|eval G[c]||A+D+F};eval B+C
def P l,s;l.map{|a|a<<s};end
J={??=>(K="a=k.pop;")+"k<<[{Y=>n=[a[0]]},a[1]<<n]",
?*=>K+(H="P a[1],s={Y=>n=[a[0]]};")+"k<<[s,[n]]",
?+=>K+H+"k<<[a[0],[n]]",
Y=>(I=K+"b=k.pop;")+"k<<[{Y=>[a[0],b[0]]},a[1]+b[1]]",
?\0=>I+"P b[1],a[0];k<<[b[0],a[1]]"}
k=[];s.each_char{|c|eval J[c]||"k<<[{c=>a=[]},[a]]"}
e=k[0];P e[1],R;
p lt;.map{|l|s=l.chop;*a=e[0]
s.each_char{|c|eval@f="n=a;a=a.map{|h|h[Y]||[h]}.flatten"while a!=n
a=a.inject([]){|a,h|a+(h[c]||[])}}
eval@f;a.include? R}

(This will also works in Ruby 1.8 with 45 more chars by adding the alias below)

Test with type testcase.txt | ruby regex.rb

An implementation of Russ Cox's NFA parser in Ruby. A state is represented as a hash with a single key which holds an array of next states.

Ungolfed:

#needed to run on ruby 1.8
class String
  alias :each_char :each_byte 
end  

## Infix to Postfix

R=gets.chop  
p "regexp = #{R}"
k=[] 
s=''  #will hold postfix string
n=a=0 #count braNches and Atoms
postfix = R.each_char{|c|
  case c
    when ?(
        (a-=1;s<<0)if a>1
        k<<[n,a]
        n=a=0
    when ?|
      s<<0while 0<a-=1
      n+=1
    when ?)
      s<<0while 0<a-=1
      s<<?|while 0<=n-=1
      n,a=k.pop;a+=1
    when ?*,?+,??
      s<< c
    else
      (a-=1;s<<0)if a>1
      s<< c
      a+=1
  end
}
s<<0while 0<a-=1
s<<?|while 0<=n-=1

## Postfix to NFA
# State is {char=>s0=[state,state]]
# Frag is [state, [s0,...]]

Y=?|   #choice indicator
X={:true=>[R]} #matcstate

 def patch l,s   #l is list of arrays, s is state.  Put s in each array
   l.map{|a|   a<< s   }
end

k=[]
s.each_char{|c|
 case c
    when ??
      a=k.pop
      s={Y=>n=[a[0]]}
      k<<[s,a[1]<<n]
    when ?*
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[s,[n]]
    when ?+
      a=k.pop
      s={Y=>n=[a[0]]}
      patch(a[1],s)
      k<<[a[0],[n]]
    when ?|
     b=k.pop
     a=k.pop
      s={Y=>[a[0],b[0]]}
      k<<[s,a[1]+b[1]]
    when 0
     b=k.pop
     a=k.pop
     patch(a[1],b[0])
     k<<[a[0],b[1]]
   else
     k<<[{c=>a=[]},[a]]
   end
}
e=k.pop
patch(e[1],X)

#Running the NFA

E=[e[0]] #Evaluator
p lt;.map{|l|s=l.chop
  p "evaluating: #{s}" 
  a=E
  s.each_char{|c|
    begin     #skip past split nodes
      s=a.size
      a=a.map{|h|h[?|]||[h]}.flatten  
    end while a.size!=s
    a=a.inject([]){|a,h|
    a+(h[c]||[])}  #add next state or null
  }
  a=a.map{|h|h[?|]||[h]}.flatten
  r = a.include? X  #check for end of pattern
  p r
  r
}

回复收藏 0 原文

メ斷腸人バ 2024-09-21 18:09:11

Perl，596 个字符

半解释：

这是基于高阶 Perl 第 8 章中发现的连续传递式解析器组合器。
Vincent Pit (VPIT) 帮助我删除了大约 70 个字符。
S { block } 与 sub { block } 相同，只是每次短 2 个字符。
$, 为 nil（包含始终匹配的匹配器的 coderef，不消耗任何内容）
c 是 n 元串联（采用任意数量的匹配器并返回一个成功的匹配器，如果他们都按顺序成功）。
a 是 n 元交替（采用任意数量的匹配器，如果其中任何一个匹配器成功，则返回一个成功的匹配器）。
A 是正则表达式构建器的助手 - 它采用交替连接的结构，并根据需要传递给 C 和 a，返回一个匹配者。
k 是星号（获取一个匹配器并返回一个按顺序匹配它 0 次或多次的匹配器。k 对于 Kleene，因为 s()被解析为 s/// 运算符:)
do 块一次解析正则表达式一个字符。 @$r 是当前串联列表，@a 是当前交替列表，@p 是 paren-group 堆栈。 a+ 被视为 aa*，而 a? 在 中被视为内联 (a|) b（plus 或 maybe 没有函数）。
最后一行中的 S{!length pop} 是一个在输入结束时成功的匹配器。它作为正则表达式匹配器的延续传递，这意味着正则表达式只有在可以匹配整个字符串时才会成功。

对于大部分经过脱节和更多注释的代码，请参阅此要点。

将其运行为 perl regexer.pl 'a?b+|(a+b|b+a?)+' abb babab aaa aabba a b。代码中没有强制换行符。

use feature':5.12';
sub S(&){pop}
$,=S{goto pop};
sub p{push@{+shift},@_}
sub c{my$l=$,;for my$r(@_){my$L=$l;
$l=S{my($i,$c)=@_;&$L($i,S{&$r(shift,$c)})}}$l}
sub a{my@A=@_;S{my($i,$c,$o)=@_;$o=&$_($i,$c)and return$o for@A;0}}
sub A{$#_?a(map c(@$_),@_):c(@{+pop})}
sub k{my($I,$k)=@_;$k=a c($I,S{&$k}),$,}
$_=shift;$P=do{@a=$r=[];for(/./g){
when('('){p\@p,[@a];@a=$r=[]}
when(')'){$p=A@a;@a=@{pop@p};p$r=$a[-1],$p}
p\@a,$r=[]when'|';
p$r,k pop@$r when'*';
p$r,c $r[-1],k pop@$r when'+';
p$r,a pop@$r,$,when '?';
my$s=$_;p$r,S{my($_,$c)=@_;s/^\Q$s//&&$_->$c}}A@a};
say&$P($_,S{!length pop})?"true":"false"for@ARGV

Perl, 596 chars

Semi-explanation:

This is based off of the continuation-passing-style parser combinators found in Chapter 8 of Higher Order Perl.
Credit for helping me remove about 70 chars goes to Vincent Pit (VPIT).
S { block } is the same as sub { block } only it's 2 chars shorter every time.
$, is nil (a coderef containing a matcher that always matches, consuming nothing)
c is n-ary concatenation (take any number of matchers and return a matcher that succeeds if they all succeed in sequence).
a is n-ary alternation (take any number of matchers and return a matcher that succeeds if any of them succeed).
A is a helper for the regex builder — it takes a structure of alternations-of-concatenations and passes to C and a as necessary, returning a matcher.
k is star (take a matcher and return a matcher that matches it 0 or more times in sequence. k for Kleene, because s() is parsed as the s/// operator :)
The do block parses the regex a char at a time. @$r is the current concatenation list, @a is the current alternation list, @p is the paren-group stack. a+ is treated as aa*, and a? is treated as (a|) inline in b (there are no functions for plus or maybe).
The S{!length pop} in the last line is a matcher that succeeds at end-of-input. It's passed as the continuation for the regex matcher, which means that the regex only succeeds if it can match the entire string.

For mostly-degolfed and more-commented code, see this Gist.

Run it as perl regexer.pl 'a?b+|(a+b|b+a?)+' abb babab aaa aabba a b. No mandatory linebreaks in the code.

use feature':5.12';
sub S(&){pop}
$,=S{goto pop};
sub p{push@{+shift},@_}
sub c{my$l=$,;for my$r(@_){my$L=$l;
$l=S{my($i,$c)=@_;&$L($i,S{&$r(shift,$c)})}}$l}
sub a{my@A=@_;S{my($i,$c,$o)=@_;$o=&$_($i,$c)and return$o for@A;0}}
sub A{$#_?a(map c(@$_),@_):c(@{+pop})}
sub k{my($I,$k)=@_;$k=a c($I,S{&$k}),$,}
$_=shift;$P=do{@a=$r=[];for(/./g){
when('('){p\@p,[@a];@a=$r=[]}
when(')'){$p=A@a;@a=@{pop@p};p$r=$a[-1],$p}
p\@a,$r=[]when'|';
p$r,k pop@$r when'*';
p$r,c $r[-1],k pop@$r when'+';
p$r,a pop@$r,$,when '?';
my$s=$_;p$r,S{my($_,$c)=@_;s/^\Q$s//&&$_->$c}}A@a};
say&$P($_,S{!length pop})?"true":"false"for@ARGV

回复收藏 0 原文

风苍溪 2024-09-21 18:09:11

JavaScript，658 个字符

// All whitespace is optional
function c(f,p){
    var x=f[0],w=p[0],h="substr",s=f[h](2),b=p[h](1),m=0,t=0,r,g,a=0,l,u="length",y="*";
    switch(f[1]){
        case"+":if(x!=w)return;f=x+y+s;
        case y:return x==w&&c(f,b)||c(s,p);
        case"?":return x==w&&c(s,b)||c(s,p)
    }
    if(x=="("){
        o:for(l=[""];t<f[u];t++){
            switch(f[t]){
                case"|":if(a==1){m=l.push("")-1;continue}break;
                case"(":if(++a==1)continue;break;
                case")":if(!--a)break o
            }
            l[m]+=f[t]
        }
        var v=0,e=t+1;
        return l.some(function(g){
            switch(f[t+1]){
                case y:v=1;
                case"+":e=t+2;
                    for(var q=0;q<f[u];q++)
                        if(c(g+Array(q).join(f[h](0,e))+f[h](e),p))
                            return 1;
                    return;
                case"?":v=1;e=t+2;default:if(c(g+f[h](e),p))return 1;
            }
        })||(v&&c(f[h](e),p))
    }
    return p[u]?(x==w&&c(f[h](1),b)):!f[u]
}

// Make it look nicer
function test(regex, string) { return !!c('(' + regex + ')', string); }

test('a?b+|(a+b|b+a?)+', 'abb') // true
test('a?b+|(a+b|b+a?)+', 'babab') // true

Ungolfed，约 1500 个字符

function test(reg, str) {
    console.log('Testing ' + reg + ' against ' + str);
    var c = reg[0], d = str[0];


    switch (reg[1]) {
        case '+': 
            if (c != d) 
                return false;   
            reg = c + '*' + reg.substr(2);
        case '*': 
            return (c == d && test(reg, str.substr(1))) || test(reg.substr(2), str);
        case '?': 
            return (c == d && test(reg.substr(2), str.substr(1))) || test(reg.substr(2), str);
    }

    if (c == '(') {
        var regs = [''];
        o: for (var level = n = i = 0; i < reg.length; i++) {
            //console.log(level + ': ' + n + ': ' + reg[i]);
            switch (reg[i]) {
                case '|': 
                    if (level == 1) { n = regs.push('') - 1; continue; } 
                    break;
                case '(': 
                    if (++level == 1) continue; 
                    break;
                case ')': 
                    if (!--level) break o; 
                    break;
            };
            regs[n] += reg[i];
        }
        //console.log(regs); // An array of alternates (hello|hi) => ['hello', 'hi']        

        var optional = false, end = i+1;
        return regs.some(function(jitem) {
            switch (reg[i+1]) {
                case '*': optional = true; // Fall through
                case '+': 
                    end = i+2;
                    for (var k = 0; k < reg.length; k++)
                        if (test(jitem + Array(k).join(reg.substr(0, i+2)) + reg.substr(i+2), str))
                            return true;
                    return false;

                case '?': optional = true; end = i+2; // Fall through
                default: if (test(jitem + reg.substr(end), str)) return true;       
            }

        }) || (optional && test(reg.substr(end), str));
    }

    if (str == '')
        return reg == '';

    return c == d ? test(reg.substr(1), str.substr(1)) : false;

}

它通过递归、砍掉正则表达式的前面和字符串的前面并调用自身来工作。例如， test("hello", "hello") =>测试（“你好”，“你好”）=>测试（“llo”，“llo”）=>测试（“lo”，“lo”）=>测试（“o”，“o”）=> test("", "") 返回 true。注意：裸 c 函数不支持隐式交替。换句话说，hello|hi 将不起作用；它必须用括号括起来：(hello|hi)。

JavaScript, 658 chars

// All whitespace is optional
function c(f,p){
    var x=f[0],w=p[0],h="substr",s=f[h](2),b=p[h](1),m=0,t=0,r,g,a=0,l,u="length",y="*";
    switch(f[1]){
        case"+":if(x!=w)return;f=x+y+s;
        case y:return x==w&&c(f,b)||c(s,p);
        case"?":return x==w&&c(s,b)||c(s,p)
    }
    if(x=="("){
        o:for(l=[""];t<f[u];t++){
            switch(f[t]){
                case"|":if(a==1){m=l.push("")-1;continue}break;
                case"(":if(++a==1)continue;break;
                case")":if(!--a)break o
            }
            l[m]+=f[t]
        }
        var v=0,e=t+1;
        return l.some(function(g){
            switch(f[t+1]){
                case y:v=1;
                case"+":e=t+2;
                    for(var q=0;q<f[u];q++)
                        if(c(g+Array(q).join(f[h](0,e))+f[h](e),p))
                            return 1;
                    return;
                case"?":v=1;e=t+2;default:if(c(g+f[h](e),p))return 1;
            }
        })||(v&&c(f[h](e),p))
    }
    return p[u]?(x==w&&c(f[h](1),b)):!f[u]
}

// Make it look nicer
function test(regex, string) { return !!c('(' + regex + ')', string); }

test('a?b+|(a+b|b+a?)+', 'abb') // true
test('a?b+|(a+b|b+a?)+', 'babab') // true

Ungolfed, ~1500 chars

function test(reg, str) {
    console.log('Testing ' + reg + ' against ' + str);
    var c = reg[0], d = str[0];


    switch (reg[1]) {
        case '+': 
            if (c != d) 
                return false;   
            reg = c + '*' + reg.substr(2);
        case '*': 
            return (c == d && test(reg, str.substr(1))) || test(reg.substr(2), str);
        case '?': 
            return (c == d && test(reg.substr(2), str.substr(1))) || test(reg.substr(2), str);
    }

    if (c == '(') {
        var regs = [''];
        o: for (var level = n = i = 0; i < reg.length; i++) {
            //console.log(level + ': ' + n + ': ' + reg[i]);
            switch (reg[i]) {
                case '|': 
                    if (level == 1) { n = regs.push('') - 1; continue; } 
                    break;
                case '(': 
                    if (++level == 1) continue; 
                    break;
                case ')': 
                    if (!--level) break o; 
                    break;
            };
            regs[n] += reg[i];
        }
        //console.log(regs); // An array of alternates (hello|hi) => ['hello', 'hi']        

        var optional = false, end = i+1;
        return regs.some(function(jitem) {
            switch (reg[i+1]) {
                case '*': optional = true; // Fall through
                case '+': 
                    end = i+2;
                    for (var k = 0; k < reg.length; k++)
                        if (test(jitem + Array(k).join(reg.substr(0, i+2)) + reg.substr(i+2), str))
                            return true;
                    return false;

                case '?': optional = true; end = i+2; // Fall through
                default: if (test(jitem + reg.substr(end), str)) return true;       
            }

        }) || (optional && test(reg.substr(end), str));
    }

    if (str == '')
        return reg == '';

    return c == d ? test(reg.substr(1), str.substr(1)) : false;

}

This works by recursing, by chopping off the front of regex and the front of the string, and calling itself. For example, test("hello", "hello") => test("ello", "ello") => test("llo", "llo") => test("lo", "lo") => test("o", "o") => test("", "") returns true. Note: the bare c function will not support implied alternation. In other words, hello|hi will not work; it has to be wrapped in parentheses: (hello|hi).

回复收藏 0 原文

~没有更多了~