无法将 URL 解析器正则表达式转换为 Ragel
我在 RFC 2396 和 RFC 3986 中找到了一个 URL 解析器正则表达式。
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
我将其转换为 Ragel:
%%{
# RFC 3986 URI Generic Syntax (January 2005)
machine url_parser;
action pchar {
printf("%c", fc);
}
action scheme { printf("scheme\n"); }
action scheme_end { printf("\nscheme_end\n"); }
action authority { printf("authority\n"); }
action authority_end { printf("\nauthority_end\n"); }
action path { printf("path\n"); }
action path_end { printf("\npath_end\n"); }
action query { printf("query\n"); }
action query_end { printf("\nquery_end\n"); }
action fragment { printf("fragment\n"); }
action fragment_end { printf("\nfragment_end\n"); }
scheme = (any - [:/?#])+ >scheme $pchar %scheme_end ;
authority = (any - [/?#])* >authority $pchar %authority_end ;
path = (any - [?#])* >path $pchar %path_end ;
query = (any - [#])* >query $pchar %query_end ;
fragment = (any)* >fragment $pchar %fragment_end ;
main := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;
}%%
#include <cstdio>
#include <cstdlib>
#include <string>
/** Data **/
%% write data;
int main(int argc, char **argv) {
std::string str(argv[1]);
char const* p = str.c_str();
char const* pe = p + str.size();
char const* eof = pe;
int cs = 0;
%% write init;
%% write exec;
return p - str.c_str();
}
当我输入绝对 URI 时它就可以工作。
liangxu@dev64:~$ ./uri_test "http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
scheme
http
scheme_end
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end
当我输入权限和路径时成功:
liangxu@dev64:~$ ./uri_test "//www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end
但当我仅输入路径时失败:
liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"
出了什么问题?
I found an URL parser regular expression at RFC 2396 and RFC 3986.
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
I converted it to Ragel:
%%{
# RFC 3986 URI Generic Syntax (January 2005)
machine url_parser;
action pchar {
printf("%c", fc);
}
action scheme { printf("scheme\n"); }
action scheme_end { printf("\nscheme_end\n"); }
action authority { printf("authority\n"); }
action authority_end { printf("\nauthority_end\n"); }
action path { printf("path\n"); }
action path_end { printf("\npath_end\n"); }
action query { printf("query\n"); }
action query_end { printf("\nquery_end\n"); }
action fragment { printf("fragment\n"); }
action fragment_end { printf("\nfragment_end\n"); }
scheme = (any - [:/?#])+ >scheme $pchar %scheme_end ;
authority = (any - [/?#])* >authority $pchar %authority_end ;
path = (any - [?#])* >path $pchar %path_end ;
query = (any - [#])* >query $pchar %query_end ;
fragment = (any)* >fragment $pchar %fragment_end ;
main := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;
}%%
#include <cstdio>
#include <cstdlib>
#include <string>
/** Data **/
%% write data;
int main(int argc, char **argv) {
std::string str(argv[1]);
char const* p = str.c_str();
char const* pe = p + str.size();
char const* eof = pe;
int cs = 0;
%% write init;
%% write exec;
return p - str.c_str();
}
It's work when I input absolute URI.
liangxu@dev64:~$ ./uri_test "http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
scheme
http
scheme_end
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end
And success when I input authority and path:
liangxu@dev64:~$ ./uri_test "//www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end
But failed when I input only path:
liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"
What's wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

你使用了错误的守护者
<:
,一旦权限部分看到你的第一个/
,控制权就交给了权限部分。如果您看到
<:
的别名,那就很清楚了,这意味着,在与左侧表达式匹配的每个转换状态上,它将保持较高的优先级,避免右侧表达式。
如果将 ABNF 表示法转换为 ragel 会容易得多。
U use the wrong guardian
<:
,once the authority section see your first/
, the control gived to authority section.It makes clear if u see the alias of
<:
which isIt means, on every transition state that match on the left expr, it will hold the HIGHER prioritize, avoiding the right expression.
Much easier if u convert the ABNF notation to ragel.