在字符串中查找正则谱系的所有祖先

发布于 2025-02-11 03:01:44 字数 1431 浏览 1 评论 0原文

我有一个过于复杂的正则是，据我所知，

route = r"""[\s+|\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s+injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

对于re.search，我设法获得了众多模式之一，如果

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

我在其他地方阅读了其他地方使用re。查找查找所有出现。在我的梦中，这将回来：

['IV', 'IM']

不幸的是，结果是：

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

原文

I have an overly complicated regex that as far as I know is correct

route = r"""[\s+|\(][iI](\.)?[vV](\.)?(\W|\s|$)?
               |\s intravenously|\s intravenous
               |[\s|\(][pP](\.)?[oO](\.)?(\W|\s|$)
               |\s perorally|\s?(per)?oral(ly)?|\s intraduodenally
               |[\s|\(]i(\.)?p(\.)?(\W|\s|$)?  
               |\s intraperitoneal(ly)?
               |[\s|\(]i(\.)?c(\.)?v(\.)?(\W|\s|$)? 
               |\s intracerebroventricular(ly)?
               |[\s|\(][iI](\.)?[gG](\.)?(\W|\s|$)?
               |\s intragastric(ly)?
               |[\s|\(]s(\.)?c(\.)?(\W|\s|$)?
               |subcutaneous(ly)?(\s+injection)?
               |[\s|\(][iI](\.)?[mM](\.)?(\W|\s|$)? 
               |\sintramuscular
          """

With re.search I manage to get one of the numerous patterns if it is a string

s = 'Pharmacokinetics parameters evaluated after single IV or IM'

m = re.search(re.compile(route, re.X), s)
m.group(0)
' IV '

I read somewhere else to use re.findall to find all the occurrences.
In my dreams, this would return:

['IV', 'IM']

Unfortunately instead the result is:

[('',
  '',
  ' ',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  ''),
 ('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '')]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

提笔书几行 2025-02-18 03:01:44

对于摘录，您显示：

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s+ injection )? \b
)

demo

eveptices：

您的图案很长，您已经使用re.x opept这是一件好事，通过以严格且可读性的方式格式化图案来利用它最大。最终，使用字母顺序。听起来可能很愚蠢，但是节省时间！也可以从＃开始添加内联注释。
在两种不同的情况下，您有许多具有相同字符的角色类=＆gt;也使用全局re.i标志，并在较低的情况下写下您的模式。
我看到您尝试用\ s或丑陋的[\ s | \（]之类的东西界定子字符串（您无需在字符类中逃脱括号|并不意味着或内部）和（\ w | \ s | $）？（这是完全没有用的） Word Bornaries \ B（请阅读有关它匹配的情况）。
使用字符串。
使用非捕捉组（？：...）而不是捕获组（...）。 .findall returns only the capture groups content and not the whole match).
factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With考虑到这个subpatern （？：per）？（？：ly）？ \ b | p（？：eroral（？：ly）？\ b | \。？o \。？）
您也可以在可能的情况下从正确的情况下分解。这不是一个很好的改进，但可以减少图案大小。

For the excerpt you show:

\b
(?: i 
    (?: ntra
        (?: cerebroventricular (?:ly)?
          | duodenally
          | gastric (?:ly)?
          | muscular
          | peritoneal (?:ly)?
          | venous (?:ly)?
        ) \b
      | \.? (?: [gmpv] | c \.? v ) \b \.?
    )
  |
    (?:per)? oral (?:ly)? \b
  |
    p \.? o \b \.?
  |
    subcutaneous (?:ly)? (?: \s+ injection )? \b
)

demo

Advices:

You have a very long pattern, you already use the re.X option that is a good thing, exploit it to the maximum by formatting the pattern in a rigorous and readable way. Eventually, use the alphabetic order. It may sound silly, but what a time saver! It's also possible to add inline comments starting with #.
you have many character classes with the same character in two different cases => use the global re.I flag too and write your pattern in lower case.
I see you try to delimit substrings with things like \s or the ugly [\s|\(] (you don't need to escape a parenthesis in a character class and | doesn't mean OR inside it) and (\W|\s|$)? (that is totally useless since you make it optional). Forget that and use word boundaries \b (read about it to well understand in which cases it matches).
Use re.findall instead or re.search since you expect several matches in a single string.
Use non-capturing groups (?: ... ) instead of capturing groups ( ... ). (when a pattern contains capture groups, re.findall returns only the capture groups content and not the whole match).
factorize your pattern from the left (the pattern is tested from left to right, a factorization from the left reduces the number of branches to test). With this in mind, this subpattern (?:per)? oral (?:ly)? \b | p \.? o \b \.? could be rewritten in this way: oral (?:ly)? \b | p (?: eroral (?:ly)? \b | \.? o \.?)
you can also factorize from the right when possible. It's not a great improvement but it reduces the pattern size.

回复收藏 0 原文

羁拥 2025-02-18 03:01:44

使用（？：。）？ TP在单词模式中找不到一个或一个时期。请注意，我发现IP或IP该模式匹配并不能在模式中排除模式，例如poip旁边的IP。

打印（“找到有或没有周期的单词的组合”）

  data="single intravenously intravenous IV oral PO intraperitoneal intraperitoneally i.p. ip"

matches=re.findall(r'[iI](?:\.)?[vV](?:\.)?|intravenous(?:ly)?|[pP](?:\.)?[oO](?:\.)?|peroral(?:ly)?|oral(?:ly)?|[iI](?:\.)?[pP](?:\.)?|intraperitoneal(?:ly)?',data)
print(matches)

输出：

['intravenously', 'intravenous', 'IV', 'oral', 'PO', 'intraperitoneal', 'intraperitoneally', 'i.p.', 'ip']

use (?:.)? tp find none or one period in the word pattern. Notice I found ip or i.p. The pattern matching does not exclude patterns within patterns, for example ip next to POip.

print ("find combinations of words with or without periods")

  data="single intravenously intravenous IV oral PO intraperitoneal intraperitoneally i.p. ip"

matches=re.findall(r'[iI](?:\.)?[vV](?:\.)?|intravenous(?:ly)?|[pP](?:\.)?[oO](?:\.)?|peroral(?:ly)?|oral(?:ly)?|[iI](?:\.)?[pP](?:\.)?|intraperitoneal(?:ly)?',data)
print(matches)

output:

['intravenously', 'intravenous', 'IV', 'oral', 'PO', 'intraperitoneal', 'intraperitoneally', 'i.p.', 'ip']

回复收藏 0 原文

~没有更多了~

关于作者

短叹

暂无简介

文章

28 人气

关注发私信

櫻之舞

文章 0 评论 0

关注

弥枳

文章 0 评论 0

关注

m2429

文章 0 评论 0

关注

寻找一个思念的角度

文章 0 评论 0

关注

野却迷人

文章 0 评论 0

关注

我怀念的。

文章 0 评论 0

友情链接

文江博客

在字符串中查找正则谱系的所有祖先

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

在字符串中查找正则谱系的所有祖先

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。