在字符串中查找正则谱系的所有祖先

发布于 2025-02-11 03:01:44 字数 1431 浏览 1 评论 0原文

我有一个过于复杂的正则是,据我所知,

route = r"""[\s+|\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s+injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

对于re.search,我设法获得了众多模式之一,如果

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

我在其他地方阅读了其他地方使用re。查找查找所有出现。 在我的梦中,这将回来:

['IV', 'IM']

不幸的是,结果是:

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

I have an overly complicated regex that as far as I know is correct

route = r"""[\s+|\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s+injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

With re.search I manage to get one of the numerous patterns if it is a string

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

I read somewhere else to use re.findall to find all the occurrences.
In my dreams, this would return:

['IV', 'IM']

Unfortunately instead the result is:

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

提笔书几行 2025-02-18 03:01:44

对于摘录,您显示:

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s+ injection )? \b
)

demo

eveptices:

  • 您的图案很长,您已经使用re.x opept这是一件好事,通过以严格且可读性的方式格式化图案来利用它最大。最终,使用字母顺序。听起来可能很愚蠢,但是节省时间!也可以从开始添加内联注释。
  • 在两种不同的情况下,您有许多具有相同字符的角色类=>也使用全局re.i标志,并在较低的情况下写下您的模式。
  • 我看到您尝试用\ s或丑陋的[\ s | \(]之类的东西界定子字符串(您无需在字符类中逃脱括号|并不意味着或内部)和(\ w | \ s | $)?(这是完全没有用的) Word Bornaries \ B(请阅读有关它匹配的情况)。
  • 使用 字符串。
  • 使用非捕捉组(?:...)而不是捕获组(...)。 .findall returns only the capture groups content and not the whole match).
  • factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With考虑到这个subpatern (?:per)? (?:ly)? \ b | p(?:eroral(?:ly)?\ b | \。?o \。?)
  • 您也可以在可能的情况下从正确的情况下分解。这不是一个很好的改进,但可以减少图案大小。

For the excerpt you show:

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s+ injection )? \b
)

demo

Advices:

  • You have a very long pattern, you already use the re.X option that is a good thing, exploit it to the maximum by formatting the pattern in a rigorous and readable way. Eventually, use the alphabetic order. It may sound silly, but what a time saver! It's also possible to add inline comments starting with #.
  • you have many character classes with the same character in two different cases => use the global re.I flag too and write your pattern in lower case.
  • I see you try to delimit substrings with things like \s or the ugly [\s|\(] (you don't need to escape a parenthesis in a character class and | doesn't mean OR inside it) and (\W|\s|$)? (that is totally useless since you make it optional). Forget that and use word boundaries \b (read about it to well understand in which cases it matches).
  • Use re.findall instead or re.search since you expect several matches in a single string.
  • Use non-capturing groups (?: ... ) instead of capturing groups ( ... ). (when a pattern contains capture groups, re.findall returns only the capture groups content and not the whole match).
  • factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With this in mind, this subpattern (?:per)? oral (?:ly)? \b | p \.? o \b \.? could be rewritten in this way: oral (?:ly)? \b | p (?: eroral (?:ly)? \b | \.? o \.?)
  • you can also factorize from the right when possible. It's not a great improvement but it reduces the pattern size.
羁拥 2025-02-18 03:01:44

使用 (?:。)? TP在单词模式中找不到一个或一个时期。请注意,我发现IP或IP该模式匹配并不能在模式中排除模式,例如poip旁边的IP。

打印(“找到有或没有周期的单词的组合”)

  data="single intravenously intravenous IV oral PO intraperitoneal intraperitoneally i.p. ip"

matches=re.findall(r'[iI](?:\.)?[vV](?:\.)?|intravenous(?:ly)?|[pP](?:\.)?[oO](?:\.)?|peroral(?:ly)?|oral(?:ly)?|[iI](?:\.)?[pP](?:\.)?|intraperitoneal(?:ly)?',data)
print(matches)

输出:

['intravenously', 'intravenous', 'IV', 'oral', 'PO', 'intraperitoneal', 'intraperitoneally', 'i.p.', 'ip']

use (?:.)? tp find none or one period in the word pattern. Notice I found ip or i.p. The pattern matching does not exclude patterns within patterns, for example ip next to POip.

print ("find combinations of words with or without periods")

  data="single intravenously intravenous IV oral PO intraperitoneal intraperitoneally i.p. ip"

matches=re.findall(r'[iI](?:\.)?[vV](?:\.)?|intravenous(?:ly)?|[pP](?:\.)?[oO](?:\.)?|peroral(?:ly)?|oral(?:ly)?|[iI](?:\.)?[pP](?:\.)?|intraperitoneal(?:ly)?',data)
print(matches)

output:

['intravenously', 'intravenous', 'IV', 'oral', 'PO', 'intraperitoneal', 'intraperitoneally', 'i.p.', 'ip']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文