Python 中的正则表达式。不匹配
我直接说:我有一个像这样的字符串(但有数千行)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
,我需要删除与 az
和 ąčęėįšųūž
加 不匹配的行_
加任何整数
(第 3 行和第 4 行与此匹配)。这应该不区分大小写。我认为正则表达式应该是
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
但是应该如何看待匹配非字母(和立陶宛字母)加下划线加整数的行的正则表达式?我尝试过
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
但没有效果。
预先感谢,如果我的英语不太好,抱歉。
I'll go straight: I have a string like this (but with thousands of lines)
Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2
and I need to remove lines that does not match a-z
and ąčęėįšųūž
plus _
plus any integer
(3rd and 4th lines match this). And this should be case insensitive. I think regex should be
[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier
But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried
re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)
but no good.
Thanks in advance, sorry if my english is not quite good.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
至于使匹配大小写不敏感,您可以使用
re
模块中的I
或IGNORECASE
标志,例如在编译正则表达式时:要删除与此正则表达式不匹配的行,您可以简单地构造一个由 do 匹配的行组成的新字符串:
As to making the matching case insensitive, you can use the
I
orIGNORECASE
flags from there
module, for example when compiling your regex:As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:
首先,给定您的示例输入,每行都以下划线+整数结尾,因此您真正需要做的就是反转原始匹配。如果这个例子并不具有代表性,那么反转匹配可能会得到如下结果:
但您可以通过这种方式进行子过滤:
另一种选择是使用 Python 集。仅当您的所有行都是唯一的(或者您不介意消除重复行)并且不关心顺序时,此方法才有意义。它的内存成本可能也很高,但速度可能很快。
First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:
But you can subfilter that this way:
Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.
不确定 python 如何处理修饰符,但要就地编辑,请使用类似以下内容(不区分大小写):
edit 请注意,其中一些字符是 utf8。要使用文字表示,您的编辑器和语言必须支持它,否则请在字符类中使用 \u.. 代码(推荐)。
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
其中正则表达式为:
r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
替换为 ''
修饰符是多行和全局的。
细分:修饰符是全局的和多行的
Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):
edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).
s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;
where the regex is:
r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.
Breakdown: modifiers are global and multiline