分析正则表达式词法分析器

发布于 2024-11-30 05:52:47 字数 3991 浏览 1 评论 0原文

我用 PHP 创建了一个路由器,它接受 DSL(基于 Rails 3 路由)并将其转换为正则表达式。它有可选的段(由(嵌套)括号表示)。以下是当前的词法分析算法:

private function tokenize($pattern)
{
    $rules = array(
        self::OPEN_PAREN_TYPE  => '/^(\()/',
        self::CLOSE_PAREN_TYPE => '/^(\))/',
        self::VARIABLE_TYPE    => '/^:([a-z0-9_]+)/',
        self::TEXT_TYPE        => '/^([^:()]+)/',
    );

    $cursor = 0;
    $tokens = array();
    $buffer = $pattern;
    $buflen = strlen($buffer);

    while ($cursor < $buflen)
    {
        $chunk = substr($buffer, $cursor);

        $matched = false;
        foreach ($rules as $type => $rule)
        {
            if (preg_match($rule, $chunk, $matches))
            {
                $tokens[] = array(
                    'type'  => $type,
                    'value' => $matches[1],
                );

                $matched = true;
                $cursor += strlen($matches[0]);
            }
        }

        if (!$matched)
        {
            throw new \Exception(sprintf('Problem parsing route "%s" at char "%d".', $pattern, $cursor));
        }
    }

    return $tokens;
}

是否有我遗漏的明显加速?有什么方法可以完全放弃 preg_* ,或者将正则表达式组合成一种模式等?

这是 xhprof 调用图(基准测试使用约 2500 条独特的路由进行测试):

callgraph

我知道最好的解决方案是不要为每个请求调用此方法(我计划使用 APC 进行缓存等),但希望使使用此库但未启用 APC 的人尽可能高效。

编辑:

我还编写了一个快速状态机版本,它似乎性能更好。我仍然愿意接受任何方面的建议,因为我相信第一个代码更优雅。

private function tokenize2($pattern)
{
    $buffer = '';
    $invariable = false;

    $tokens = array();
    foreach (str_split($pattern) as $char)
    {
        switch ($char)
        {
            case '(':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $tokens[] = array(
                    'type' => self::OPEN_PAREN_TYPE,
                );
                break;
            case ')':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $tokens[] = array(
                    'type' => self::CLOSE_PAREN_TYPE,
                );
                break;
            case ':':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $invariable = true;
                break;
            default:
                if ($invariable && !(ctype_alnum($char) || '_' == $char ))
                {
                    $invariable = false;
                    $tokens[] = array(
                        'type'  => self::VARIABLE_TYPE,
                        'value' => $buffer,
                    );

                    $buffer = '';
                    $invariable = false;
                }

                $buffer .= $char;
                break;
        }
    }

    if ($buffer)
    {
        $tokens[] = array(
        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
        'value' => $buffer,
        );
        $buffer = '';
    }

    return $tokens;

callgraph


出于性能原因,我最终只使用状态机,并使用 APC 缓存整个词法分析过程(因为......为什么不呢)。

如果有人有任何贡献,我会很乐意移动答案。

I've created a router in PHP which takes a DSL (based on the Rails 3 route) and converts it to Regex. It has optional segments (denoted by (nested) parenthesis). The following is the current lexing algorithm:

private function tokenize($pattern)
{
    $rules = array(
        self::OPEN_PAREN_TYPE  => '/^(\()/',
        self::CLOSE_PAREN_TYPE => '/^(\))/',
        self::VARIABLE_TYPE    => '/^:([a-z0-9_]+)/',
        self::TEXT_TYPE        => '/^([^:()]+)/',
    );

    $cursor = 0;
    $tokens = array();
    $buffer = $pattern;
    $buflen = strlen($buffer);

    while ($cursor < $buflen)
    {
        $chunk = substr($buffer, $cursor);

        $matched = false;
        foreach ($rules as $type => $rule)
        {
            if (preg_match($rule, $chunk, $matches))
            {
                $tokens[] = array(
                    'type'  => $type,
                    'value' => $matches[1],
                );

                $matched = true;
                $cursor += strlen($matches[0]);
            }
        }

        if (!$matched)
        {
            throw new \Exception(sprintf('Problem parsing route "%s" at char "%d".', $pattern, $cursor));
        }
    }

    return $tokens;
}

Are there any obvious speed-ups that I am missing? Any way to abandon preg_* altogether, or combine the regexes into one pattern, etc?

Here is the xhprof callgraph (benchmark uses ~2500 unique routes for testing):

callgraph

I know the best solution would be not to call this for every request (I plan on caching with APC, etc), but would like to make it as efficient as possible for people that use this library without APC enabled.

EDIT:

I also wrote a quick state machine version, which seems to perform better. I'm still open to suggestions on either front, as I believe the first code was more elegant.

private function tokenize2($pattern)
{
    $buffer = '';
    $invariable = false;

    $tokens = array();
    foreach (str_split($pattern) as $char)
    {
        switch ($char)
        {
            case '(':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $tokens[] = array(
                    'type' => self::OPEN_PAREN_TYPE,
                );
                break;
            case ')':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $tokens[] = array(
                    'type' => self::CLOSE_PAREN_TYPE,
                );
                break;
            case ':':
                if ($buffer)
                {
                    $tokens[] = array(
                        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
                        'value' => $buffer,
                    );
                    $buffer = '';
                    $invariable = false;
                }

                $invariable = true;
                break;
            default:
                if ($invariable && !(ctype_alnum($char) || '_' == $char ))
                {
                    $invariable = false;
                    $tokens[] = array(
                        'type'  => self::VARIABLE_TYPE,
                        'value' => $buffer,
                    );

                    $buffer = '';
                    $invariable = false;
                }

                $buffer .= $char;
                break;
        }
    }

    if ($buffer)
    {
        $tokens[] = array(
        'type'  => $invariable ? self::VARIABLE_TYPE : self::TEXT_TYPE,
        'value' => $buffer,
        );
        $buffer = '';
    }

    return $tokens;

callgraph


I ended up just using the state machine for performance reasons, and caching the entire lexing process with APC (because... why not).

If anyone has anything to contribute, I'll happily move the answer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

木落 2024-12-07 05:52:47

有趣的代码:)。

我不太确定你所说的“用 APC 缓存整个词法分析过程”是什么,所以我可能会建议你已经在做什么。

您可以只缓存输入 URL 以及实际词法分析过程之上的结果吗?您不希望在此处应用任何基于权限的限制,因此缓存是全局的。即使是在只有少数热点的大型站点上,路线的数量往往也是有限的。完全绕过词法分析并命中任何先前使用的路由上的缓存。

Interesting code :).

I'm not quite sure what you're saying with "caching the entire lexing process with APC", so I may be suggesting what you're already doing.

Can you just cache the input URL, and the result above the actual lexing process? You don't look to be applying any permissions based restrictions here, so the cache is global. Routes tend to be limited in number, even on a large site, with a few very hot spots. Bypass the lexing entirely and hit the cache on any previously used route.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文