解析 Gmail 风格的高级搜索语法？

发布于 2024-07-27 20:12:39 字数 689 浏览 7 评论 0原文

我想使用 Perl 解析类似于 Gmail 提供的搜索字符串。示例输入为“tag:thing by:{user1 user2} {-tag:a by:user3}”。我想将其放入树形结构中，如

{and => [
    "tag:thing",
    {or => [
       "by:user1",
       "by:user2",
    ]},
    {or => [
       {not => "tag:a"},
       "by:user3",
    ]},
}

一般规则是：

标记以空格分隔，默认为 AND 运算符。
大括号中的标记是替代选项 (OR)。大括号可以位于字段说明符之前或之后。即“by:{user1 user2}”和“{by:user1 by:user2}”是等效的。
排除以连字符为前缀的标记。

这些元素也可以组合和嵌套：例如“{by:user5 -{tag:k by:user3}} 等”。

我正在考虑编写一个上下文无关语法来表示这些规则，然后将其解析到树中。这有必要吗？（使用简单的正则表达式可以实现这一点吗？）

建议使用哪些模块来解析上下文无关语法？

（最终这将用于使用 DBIx::Class 生成数据库查询。）

原文

I want to parse a search string similar to that provided by Gmail using Perl. An example input would be "tag:thing by:{user1 user2} {-tag:a by:user3}". I want to put it into a tree structure, such as

{and => [
    "tag:thing",
    {or => [
       "by:user1",
       "by:user2",
    ]},
    {or => [
       {not => "tag:a"},
       "by:user3",
    ]},
}

The general rules are:

Tokens separated by space default to the AND operator.
Tokens in braces are alternative options (OR). The braces can go before or after the field specifier. i.e. "by:{user1 user2}" and "{by:user1 by:user2}" are equivalent.
Tokens prefixed with a hyphen are excluded.

These elements can also be combined and nested: e.g. "{by:user5 -{tag:k by:user3}} etc".

I'm thinking of writing a context-free grammar to represent these rules, and then parsing it into the tree. Is this unnecessary? (Is this possible using simple regexps?)

What modules are recommended for doing parsing context-free grammars?

(Eventually this will be used to generate an database query with DBIx::Class.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷夜 2024-08-03 20:12:41

如果您的查询不是树形结构，那么正则表达式将为您完成这项工作。

例如：

my $search = "tag:thing by:{user1 user2} {-tag:a by:user3}"
my @tokens = split /(?![^{]*})\s+/, $search;
foreach (@tokens) {
    my $or = s/[{}]//g; # OR mode
    my ($default_field_specifier) = /(\w+):/;
}

即使您的查询是树形结构，正则表达式也可以使递归解析变得更加愉快：

$_ = "by:{user1 z:{user2 3} } x {-tag:a by:user3} zz";
pos($_) = 0;
scan_query("");

sub scan_query {
    my $default_specifier = shift;
    while (/\G\s*((?:[-\w:]+)|(?={))({)?/gc) {
        scan_query($1), next if $2;
        my $query_token = $default_specifier . $1;
    }
    /\G\s*\}/gc;
}

正则表达式很棒:)！

If your query isn't tree structured, then regexes will do the job for you.

For example:

my $search = "tag:thing by:{user1 user2} {-tag:a by:user3}"
my @tokens = split /(?![^{]*})\s+/, $search;
foreach (@tokens) {
    my $or = s/[{}]//g; # OR mode
    my ($default_field_specifier) = /(\w+):/;
}

Even if your query is tree structured, regexes can make recursive parsing much more pleasant:

$_ = "by:{user1 z:{user2 3} } x {-tag:a by:user3} zz";
pos($_) = 0;
scan_query("");

sub scan_query {
    my $default_specifier = shift;
    while (/\G\s*((?:[-\w:]+)|(?={))({)?/gc) {
        scan_query($1), next if $2;
        my $query_token = $default_specifier . $1;
    }
    /\G\s*\}/gc;
}

Regexes are awesome :)!

回复收藏 0 原文