使用正则表达式搜索和捕获字符 Python

发布于 2024-10-02 12:39:16 字数 1158 浏览 7 评论 0原文

在解决 Python Challenge 中的问题之一时，我尝试按如下方式解决它：

读取中的输入一个包含以下字符的文本文件：

DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQtvZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......

我需要的是浏览此文本文件并选择每侧仅由三个大写字母包围的所有小写字母。

我为执行上述操作而编写的 python 脚本如下：

import re

pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
    result = pattern.search(line)
    if result:
       print result.groups()

 f.close()

上面给出的脚本不是返回捕获（小写字符列表），而是返回满足正则表达式条件的所有文本块，例如

aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........

有人可以告诉我什么吗我到底在这里做错了吗？是否有另一种方法可以对整个文件运行正则表达式搜索，而不是循环遍历整个文件？

谢谢

原文

While going through one of the problems in Python Challenge, I am trying to solve it as follows:

Read the input in a text file with characters as follows:

DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQtvZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......

What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.

The python script that I wrote to do the above is as follows:

import re

pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
    result = pattern.search(line)
    if result:
       print result.groups()

 f.close()

The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like

aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........

Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

负佳期 2024-10-09 12:39:16

将 result.groups() 更改为 result.group(1)，您将仅获得单个字母匹配。

您的代码的第二个问题是它不会在一行上找到多个结果。因此，您不需要使用 re.search，而是需要 re.findall 或 re.finditer。 findall 将返回字符串或字符串元组，而 finditer 返回匹配对象。

这是我解决同样问题的地方：

import urllib
import re    

pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
    "http://www.pythonchallenge.com/pc/def/equality.html").read()))

请注意，re.findall 和 re.finditer 返回不重叠的结果。因此，当使用上述模式与 re.findall 搜索字符串 'aBBBcDDDeFFFg' 时，唯一的匹配将是 'c'，而不是 <代码>'e'。幸运的是，这个 Python Challenge 问题不包含这样的例子。

Change result.groups() to result.group(1) and you will get just the single letter match.

A second problem with your code is that it will not find multiple results on one line. So instead of using re.search you'll need re.findall or re.finditer. findall will return strings or tuples of strings, whereas finditer returns match objects.

Here's where I approached the same problem:

import urllib
import re    

pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
    "http://www.pythonchallenge.com/pc/def/equality.html").read()))

Note that re.findall and re.finditer return non-overlapping results. So when using the above pattern with re.findall searching against string 'aBBBcDDDeFFFg', your only match will be 'c', but not 'e'. Fortunately, this Python Challenge problem contains no such such examples.

回复收藏 0 原文

病毒体 2024-10-09 12:39:16

我建议使用lookaround：

(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])

这不会有重叠匹配的问题。

说明：

(?<=[A-Z]{3})  # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z])        # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3})   # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z])  # assert that there is no uppercase letter 4 characters after the current position

I'd suggest using lookaround:

(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])

This will have no problem with overlapping matches.

Explanation:

(?<=[A-Z]{3})  # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z])        # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3})   # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z])  # assert that there is no uppercase letter 4 characters after the current position

回复收藏 0 原文

独闯女儿国 2024-10-09 12:39:16

import re

with open('/Users/Dev/Sometext.txt','r') as f: 
    tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())

    for token ins tokens:
        print token

findall 的作用：

返回所有不重叠的匹配项
字符串中的模式，作为列表
字符串。字符串被扫描
从左到右，匹配项是
按找到的顺序返回。如果一个或
更多团体出现在
模式，返回组列表；这
将是一个元组列表，如果
模式有多个组。空的
匹配项包含在结果中
除非他们触及了开头
另一场比赛。

也许是 re 模块中最有用的函数。

read() 函数将整个文件读入大字符串中。如果您需要将正则表达式与整个文件进行匹配，这尤其有用。

警告：根据文件的大小，您可能更喜欢像第一种方法那样逐行迭代文件。

import re

with open('/Users/Dev/Sometext.txt','r') as f: 
    tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())

    for token ins tokens:
        print token

What findall does:

Return all non-overlapping matches of
pattern in string, as a list of
strings. The string is scanned
left-to-right, and matches are
returned in the order found. If one or
more groups are present in the
pattern, return a list of groups; this
will be a list of tuples if the
pattern has more than one group. Empty
matches are included in the result
unless they touch the beginning of
another match.

Maybe the most useful function in the re module.

The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.

Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.

回复收藏 0 原文

~没有更多了~