用于查找有效 sphinx 字段的正则表达式

发布于 2024-08-29 23:14:58 字数 530 浏览 8 评论 0原文

我正在尝试验证提供给 sphinx 的字段是否有效,但我遇到了困难。

想象一下,有效字段是猫、老鼠、狗、小狗。

有效搜索将是:

  • @cat search terms
  • @(cat) search terms
  • @(cat, dog) search term
  • @cat searchterm1 @dog searchterm2
  • @(cat, dog) searchterm1 @mouse searchterm2

所以,我想使用正则表达式在上面的示例中查找“猫”、“狗”、“老鼠”等术语,并根据有效术语列表检查它们。

因此,查询如下: @(goat)

会产生错误,因为 goat 不是有效术语。

我已经得到了这样的结果,我可以使用这个正则表达式找到简单的查询,例如 @cat: (?:@)([^( ]*)

但我不知道如何找到其余的。

我正在使用 python & ; django,这是值得的。

I'm trying to validate that the fields given to sphinx are valid, but I'm having difficulty.

Imagine that valid fields are cat, mouse, dog, puppy.

Valid searches would then be:

  • @cat search terms
  • @(cat) search terms
  • @(cat, dog) search term
  • @cat searchterm1 @dog searchterm2
  • @(cat, dog) searchterm1 @mouse searchterm2

So, I want to use a regular expression to find terms such as cat, dog, mouse in the above examples, and check them against a list of valid terms.

Thus, a query such as:
@(goat)

Would produce an error because goat is not a valid term.

I've gotten so that I can find simple queries such as @cat with this regex: (?:@)([^( ]*)

But I can't figure out how to find the rest.

I'm using python & django, for what that's worth.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

十级心震 2024-09-05 23:14:58

为了匹配所有允许的字段,以下看起来相当可怕的正则表达式可以工作:

@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))

它按顺序返回这些匹配项:@cat@(cat)@(cat ,狗)@cat@dog@(猫,狗)@mouse >。

正则表达式分解如下:

@                               # the literal character "@"
(                               # match group 1
  (?:cat|mouse|dog|puppy)       #  one of your valid search terms (not captured)
  \b                            #  a word boundary
  |                             #  or...
  \(                            #  a literal opening paren
  (?:                           #  non-capturing group
    (?:cat|mouse|dog|puppy)     #   one of your valid search terms (not captured)
    (?:                         #   non-capturing group
      , *                       #    a comma "," plus any number of spaces
      |                         #    or...
      (?=\))                    #    a position followed by a closing paren
    )                           #   end non-capture group
  )+                            #  end non-capture group, repeat
  \)                            #  a literal closing paren
)                               # end match group one.

现在要识别任何无效搜索,您可以将所有内容包装在否定的前瞻中:

@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^

这将识别任何@字符,其后是无效的尝试了搜索词(或词组合)。修改它,使其也匹配无效的尝试,而不是仅仅指向它,不再那么困难了。

您必须从您的字段动态准备 (?:cat|mouse|dog|puppy) 并将其插入正则表达式的静态其余部分。也不应该太难做到。

To match all allowed fields, the following rather fearful looking regex works:

@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))

It returns these matches, in order: @cat, @(cat), @(cat, dog), @cat, @dog, @(cat, dog), @mouse.

The regex breaks down as follows:

@                               # the literal character "@"
(                               # match group 1
  (?:cat|mouse|dog|puppy)       #  one of your valid search terms (not captured)
  \b                            #  a word boundary
  |                             #  or...
  \(                            #  a literal opening paren
  (?:                           #  non-capturing group
    (?:cat|mouse|dog|puppy)     #   one of your valid search terms (not captured)
    (?:                         #   non-capturing group
      , *                       #    a comma "," plus any number of spaces
      |                         #    or...
      (?=\))                    #    a position followed by a closing paren
    )                           #   end non-capture group
  )+                            #  end non-capture group, repeat
  \)                            #  a literal closing paren
)                               # end match group one.

Now to identify any invalid search, you would wrap all that in a negative look-ahead:

@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^

This would identify any @ character after which an invalid search term (or term combination) was attempted. Modifying it so that it also matches the invalid attempt instead of just pointing at it is not that hard anymore.

You would have to prepare (?:cat|mouse|dog|puppy) from your field dynamically and plug it into the static rest of the regex. Should not be too hard to do either.

迷荒 2024-09-05 23:14:58

这个 pyparsing 解决方案遵循与您发布的答案类似的逻辑路径。所有标签都会匹配,然后对照已知有效标签列表进行检查,将它们从报告的结果中删除。只有那些在删除有效值后仍保留值的匹配才会被报告为匹配。

from pyparsing import *

# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )

# define tags we consider to be valid
valid = set("cat mouse dog".split())

# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
    tokens = [t for t in tokens.terms if t not in valid]
    if not(tokens):
        raise ParseException("",0,"")
    return tokens
sphxTerm.setParseAction(filterValid)


##### Test out the parser #####

test = """@cat search terms @ house
    @(cat) search terms 
    @(cat, dog) search term @(goat)
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
    @(cat, dog) searchterm1 @mouse searchterm2 
    @caterpillar"""

# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
    print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
    print line(s, test)
    print " "*(col(s,test)-1)+"^"
    print

有了这些可爱的结果:

Terms:['goat'] Line: 3 Col: 29
    @(cat, dog) search term @(goat)
                            ^

Terms:['doggerel'] Line: 4 Col: 39
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
                                      ^

Terms:['caterpillar'] Line: 6 Col: 5
    @caterpillar
    ^

最后一个片段将为您完成所有扫描,并只为您提供找到的无效标签的列表:

# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))

打印:

['caterpillar', 'goat', 'doggerel']

This pyparsing solution follows a similar logic path as your posted answer. All tags are matched, and then checked against the list of known valid tags, removing them from the reported results. Only those matches that have values left over after removing the valid ones are reported as matches.

from pyparsing import *

# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )

# define tags we consider to be valid
valid = set("cat mouse dog".split())

# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
    tokens = [t for t in tokens.terms if t not in valid]
    if not(tokens):
        raise ParseException("",0,"")
    return tokens
sphxTerm.setParseAction(filterValid)


##### Test out the parser #####

test = """@cat search terms @ house
    @(cat) search terms 
    @(cat, dog) search term @(goat)
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
    @(cat, dog) searchterm1 @mouse searchterm2 
    @caterpillar"""

# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
    print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
    print line(s, test)
    print " "*(col(s,test)-1)+"^"
    print

With these lovely results:

Terms:['goat'] Line: 3 Col: 29
    @(cat, dog) search term @(goat)
                            ^

Terms:['doggerel'] Line: 4 Col: 39
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
                                      ^

Terms:['caterpillar'] Line: 6 Col: 5
    @caterpillar
    ^

This last snippet will do all the scanning for you, and just give you the list of found invalid tags:

# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))

Prints:

['caterpillar', 'goat', 'doggerel']
毁梦 2024-09-05 23:14:58

这应该有效:

@\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|@(cat|dog|mouse|puppy)\b

它将匹配单个 @parameter 或仅包含允许的单词(一个或多个)的带括号的 @(par1, par2) 列表。

它还确保不接受部分匹配 (@caterpillar)。

This should work:

@\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|@(cat|dog|mouse|puppy)\b

It will either match a single @parameter or a parenthesized @(par1, par2) list containing only allowed words (one or more).

It also makes sure that no partial matches are accepted (@caterpillar).

一江春梦 2024-09-05 23:14:58

试试这个:

field_re = re.compile(r"@(?:([^()\s]+)|\([^()]+\))")

单个字段名称(如 @cat 中的 cat)将在组 #1 中捕获,而括号列表中的名称如 @(cat ,dog)将被存储在组#2中。在后一种情况下,您需要使用 split() 或其他方法来分解列表;无法使用 Python 正则表达式单独捕获名称。

Try this:

field_re = re.compile(r"@(?:([^()\s]+)|\([^()]+\))")

A single field name (like cat in @cat) will be captured in group #1, while the names in a parenthesized list like @(cat, dog) will be stored in group #2. In the latter case you'll need to break the list down with split() or something; there's no way to capture the names individually with a Python regex.

无可置疑 2024-09-05 23:14:58

这将匹配猫、狗、老鼠或小狗及其组合的所有字段。

import re
sphinx_term = "@goat some words to search"
regex = re.compile("@\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
    send the query to sphinx...

This will match all fields that are cat, dog, mouse, or puppy and combinations thereof.

import re
sphinx_term = "@goat some words to search"
regex = re.compile("@\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
    send the query to sphinx...
肤浅与狂妄 2024-09-05 23:14:58

我最终以不同的方式做了这件事,因为以上都不起作用。首先,我找到了像 @cat 这样的字段,如下所示:

attributes = re.findall('(?:@)([^\( ]*)', query)

接下来,我找到了更复杂的字段,如下所示:

regex0 = re.compile('''
    @               # at sign
    (?:             # start non-capturing group
        \w+             # non-whitespace, one or more
        \b              # a boundary character (i.e. no more \w)
        |               # OR
        (               # capturing group
            \(              # left paren
            [^@(),]+        # not an @(),
            (?:                 # another non-caputing group
                , *             # a comma, then some spaces
                [^@(),]+        # not @(),
            )*              # some quantity of this non-capturing group
            \)              # a right paren
        )               # end of non-capuring group
    )           # end of non-capturing group
    ''', re.VERBOSE)

# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
    attributes.extend(item.strip("(").strip(")").split(", "))

接下来,我检查了我找到的属性是否有效,并将它们添加到数组中(唯一地添加到数组中):

# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog

谢谢大家不过有帮助。我很高兴拥有它!

) # if they aren't add them to a new list. badAttrs = [] for attribute in attributes: if len(attribute) == 0: # if it's a zero length attribute, we punt continue if validRegex.search(attribute.lower()) == None: # if the attribute from the search isn't in the valid list if attribute not in badAttrs: # and the attribute isn't already in the list badAttrs.append(attribute)

谢谢大家不过有帮助。我很高兴拥有它!

I ended up doing this a different way, since none of the above worked. First I found the fields like @cat, with this:

attributes = re.findall('(?:@)([^\( ]*)', query)

Next, I found the more complicated ones, with this:

regex0 = re.compile('''
    @               # at sign
    (?:             # start non-capturing group
        \w+             # non-whitespace, one or more
        \b              # a boundary character (i.e. no more \w)
        |               # OR
        (               # capturing group
            \(              # left paren
            [^@(),]+        # not an @(),
            (?:                 # another non-caputing group
                , *             # a comma, then some spaces
                [^@(),]+        # not @(),
            )*              # some quantity of this non-capturing group
            \)              # a right paren
        )               # end of non-capuring group
    )           # end of non-capturing group
    ''', re.VERBOSE)

# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
    attributes.extend(item.strip("(").strip(")").split(", "))

Next, I checked if the attributes I found were valid, and added them (uniquely to an array):

# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog

Thanks all for the help though. I'm very glad to have had it!

) # if they aren't add them to a new list. badAttrs = [] for attribute in attributes: if len(attribute) == 0: # if it's a zero length attribute, we punt continue if validRegex.search(attribute.lower()) == None: # if the attribute from the search isn't in the valid list if attribute not in badAttrs: # and the attribute isn't already in the list badAttrs.append(attribute)

Thanks all for the help though. I'm very glad to have had it!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文