无法将 URL 解析器正则表达式转换为 Ragel

发布于 12-25 12:50 字数 2274 浏览 4 评论 0原文

我在 RFC 2396 和 RFC 3986 中找到了一个 URL 解析器正则表达式。

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

我将其转换为 Ragel:

%%{    
  # RFC 3986 URI Generic Syntax (January 2005)
  machine url_parser;

  action pchar     {
    printf("%c", fc);
  }
  action scheme            { printf("scheme\n"); }
  action scheme_end        { printf("\nscheme_end\n"); }
  action authority         { printf("authority\n"); }
  action authority_end     { printf("\nauthority_end\n"); }
  action path              { printf("path\n"); }
  action path_end          { printf("\npath_end\n"); }
  action query             { printf("query\n"); }
  action query_end         { printf("\nquery_end\n"); }
  action fragment          { printf("fragment\n"); }
  action fragment_end      { printf("\nfragment_end\n"); }

  scheme    = (any - [:/?#])+ >scheme    $pchar %scheme_end ;
  authority = (any - [/?#])*  >authority $pchar %authority_end ;
  path      = (any - [?#])*   >path      $pchar %path_end ;
  query     = (any - [#])*    >query     $pchar %query_end ;
  fragment  = (any)*          >fragment  $pchar %fragment_end ; 
  main     := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;
}%%

#include <cstdio>
#include <cstdlib>
#include <string>

/** Data **/
%% write data;

int main(int argc, char **argv) {
  std::string str(argv[1]);
  char const* p = str.c_str();
  char const* pe = p + str.size();
  char const* eof = pe;
  int cs = 0;

  %% write init;
  %% write exec;

  return p - str.c_str();
}

当我输入绝对 URI 时它就可以工作。

liangxu@dev64:~$ ./uri_test "http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
scheme
http
scheme_end
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

当我输入权限和路径时成功:

liangxu@dev64:~$ ./uri_test "//www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

但当我仅输入路径时失败:

liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"

出了什么问题?

I found an URL parser regular expression at RFC 2396 and RFC 3986.

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

I converted it to Ragel:

%%{    
  # RFC 3986 URI Generic Syntax (January 2005)
  machine url_parser;

  action pchar     {
    printf("%c", fc);
  }
  action scheme            { printf("scheme\n"); }
  action scheme_end        { printf("\nscheme_end\n"); }
  action authority         { printf("authority\n"); }
  action authority_end     { printf("\nauthority_end\n"); }
  action path              { printf("path\n"); }
  action path_end          { printf("\npath_end\n"); }
  action query             { printf("query\n"); }
  action query_end         { printf("\nquery_end\n"); }
  action fragment          { printf("fragment\n"); }
  action fragment_end      { printf("\nfragment_end\n"); }

  scheme    = (any - [:/?#])+ >scheme    $pchar %scheme_end ;
  authority = (any - [/?#])*  >authority $pchar %authority_end ;
  path      = (any - [?#])*   >path      $pchar %path_end ;
  query     = (any - [#])*    >query     $pchar %query_end ;
  fragment  = (any)*          >fragment  $pchar %fragment_end ; 
  main     := (( scheme ":" )?) <: (( "//" authority )?) <: path ( "?" query )? ( "#" fragment )?;
}%%

#include <cstdio>
#include <cstdlib>
#include <string>

/** Data **/
%% write data;

int main(int argc, char **argv) {
  std::string str(argv[1]);
  char const* p = str.c_str();
  char const* pe = p + str.size();
  char const* eof = pe;
  int cs = 0;

  %% write init;
  %% write exec;

  return p - str.c_str();
}

It's work when I input absolute URI.

liangxu@dev64:~$ ./uri_test "http://www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
scheme
http
scheme_end
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

And success when I input authority and path:

liangxu@dev64:~$ ./uri_test "//www.ics.uci.edu/pub/ietf/uri/?c=www&rot=1&e=%20%20"
authority
www.ics.uci.edu
authority_end
path
/pub/ietf/uri/
path_end
query
c=www&rot=1&e=%20%20
query_end

But failed when I input only path:

liangxu@dev64:~$ ./uri_test "/pub/ietf/uri"

What's wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谁的新欢旧爱2025-01-01 12:50:48

你使用了错误的守护者 <: ,一旦权限部分看到你的第一个 / ,控制权就交给了权限部分。

如果您看到 <: 的别名,那就很清楚了,这

expr $(unique_name,1) . expr >(unique_name,0)

意味着,在与左侧表达式匹配的每个转换状态上,它将保持较高的优先级,避免右侧表达式。

如果将 ABNF 表示法转换为 ragel 会容易得多。

U use the wrong guardian <: ,once the authority section see your first / , the control gived to authority section.

It makes clear if u see the alias of <: which is

expr $(unique_name,1) . expr >(unique_name,0)

It means, on every transition state that match on the left expr, it will hold the HIGHER prioritize, avoiding the right expression.

Much easier if u convert the ABNF notation to ragel.

百思不得你姐2025-01-01 12:50:48

我最近也做了同样的事情,你可以看看我的 ragel 语法 https://github。 com/maximecaron/ragel-url-parser

I did the same thing myself recently, you can have a look at my ragel grammar https://github.com/maximecaron/ragel-url-parser

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文