Python 中的正则表达式。不匹配

发布于 2024-10-20 08:37:59 字数 517 浏览 4 评论 0原文

我直接说:我有一个像这样的字符串(但有数千行)

Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2

,我需要删除与 aząčęėįšųūž不匹配的行_任何整数(第 3 行和第 4 行与此匹配)。这应该不区分大小写。我认为正则表达式应该是

[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier

但是应该如何看待匹配非字母(和立陶宛字母)加下划线加整数的行的正则表达式?我尝试过

re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)

但没有效果。

预先感谢,如果我的英语不太好,抱歉。

I'll go straight: I have a string like this (but with thousands of lines)

Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2

and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be

[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier

But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried

re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)

but no good.

Thanks in advance, sorry if my english is not quite good.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

节枝 2024-10-27 08:37:59

至于使匹配大小写不敏感,您可以使用 re 模块中的 IIGNORECASE 标志,例如在编译正则表达式时

regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)

:要删除与此正则表达式不匹配的行,您可以简单地构造一个由 do 匹配的行组成的新字符串:

new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))

As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:

regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)

As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:

new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))
帅哥哥的热头脑 2024-10-27 08:37:59

首先,给定您的示例输入,每行都以下划线+整数结尾,因此您真正需要做的就是反转原始匹配。如果这个例子并不具有代表性,那么反转匹配可能会得到如下结果:

abcdefg_nodigitshere

但您可以通过这种方式进行子过滤:

import re
mydigre = re.compile(r'_\d+

另一种选择是使用 Python 集。仅当您的所有行都是唯一的(或者您不介意消除重复行)并且不关心顺序时,此方法才有意义。它的内存成本可能也很高,但速度可能很快。

all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])
) myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+

另一种选择是使用 Python 集。仅当您的所有行都是唯一的(或者您不介意消除重复行)并且不关心顺序时,此方法才有意义。它的内存成本可能也很高,但速度可能很快。


, re.I)

for line in inputs.splitlines():
    if re.match(myreg, line):
        # do x
    elif re.match(mydigre, line):
        # do y
    else:
        # line doesn't end with _\d+

另一种选择是使用 Python 集。仅当您的所有行都是唯一的(或者您不介意消除重复行)并且不关心顺序时,此方法才有意义。它的内存成本可能也很高,但速度可能很快。

First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:

abcdefg_nodigitshere

But you can subfilter that this way:

import re
mydigre = re.compile(r'_\d+

Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.

all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])
) myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+

Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.


, re.I)

for line in inputs.splitlines():
    if re.match(myreg, line):
        # do x
    elif re.match(mydigre, line):
        # do y
    else:
        # line doesn't end with _\d+

Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.

动听の歌 2024-10-27 08:37:59

不确定 python 如何处理修饰符,但要就地编辑,请使用类似以下内容(不区分大小写):

edit 请注意,其中一些字符是 utf8。要使用文字表示,您的编辑器和语言必须支持它,否则请在字符类中使用 \u.. 代码(推荐)。

s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;

其中正则表达式为:r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
替换为 ''
修饰符是多行和全局的。

细分:修饰符是全局的和多行的

(?i)                              // case insensitive flag
^                                 // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$))   // look ahead, not this form of a line ?
.*                                // ok then select all except newline or eos
(?:\n|$)                          // select newline or end of string

Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):

edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).

s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;

where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.

Breakdown: modifiers are global and multiline

(?i)                              // case insensitive flag
^                                 // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$))   // look ahead, not this form of a line ?
.*                                // ok then select all except newline or eos
(?:\n|$)                          // select newline or end of string
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文