python中的多行模式匹配

发布于 2024-08-30 09:03:33 字数 830 浏览 2 评论 0原文

定期计算机生成的消息(简化):

Hello user123,

- (604)7080900
- 152
- minutes

Regards

使用 python,如何在之间提取“(604)7080900”、“152”、“分钟”(即遵循前导 "- " 模式的任何文本)两个空行(空行是“Hello user123”之后的 \n\n 和“Regards”之前的 \n\n)。如果结果字符串列表存储在数组中就更好了。谢谢!

编辑:两个空白行之间的行数不固定。

第二次编辑:

例如

hello

- x1
- x2
- x3

- x4

- x6
morning
- x7

world

x1 x2 x3 很好,因为所有行都被 2 个空行包围,出于同样的原因,x4 也很好。 x6 不好,因为它后面没有空白行,x7 不好,因为它前面没有空白。 x2 很好(不像 x6、x7),因为前面的线是好线,后面的线也很好。

当我发布问题时,这个条件可能不清楚:

a continuous of good lines between 2 empty lines

good line must have leading "- "
good line must follow an empty line or follow another good line
good line must be followed by an empty line or followed by another good line

谢谢

A periodic computer generated message (simplified):

Hello user123,

- (604)7080900
- 152
- minutes

Regards

Using python, how can I extract "(604)7080900", "152", "minutes" (i.e. any text following a leading "- " pattern) between the two empty lines (empty line is the \n\n after "Hello user123" and the \n\n before "Regards"). Even better if the result string list are stored in an array. Thanks!

edit: the number of lines between two blank lines are not fixed.

2nd edit:

e.g.

hello

- x1
- x2
- x3

- x4

- x6
morning
- x7

world

x1 x2 x3 are good, as all lines are surrounded by 2 empty lines, x4 is also good for the same reason. x6 is not good because no blank line follows it, x7 is not good as no blank in front of it. x2 is good (not like x6, x7) because the line ahead is a good line and the line following it is also good.

this conditions might be not clear when I posted the question:

a continuous of good lines between 2 empty lines

good line must have leading "- "
good line must follow an empty line or follow another good line
good line must be followed by an empty line or followed by another good line

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

橘寄 2024-09-06 09:03:33
>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>
>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>
表情可笑 2024-09-06 09:03:33

最简单的方法是遍历这些行(假设您有一个行列表或一个文件,或者将字符串拆分为一个行列表),直到看到一行 '\n' ,然后检查每一行是否以 '- ' 开头(使用 startswith 字符串方法)并将其切片,存储结果,直到找到另一个空行。例如:

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

编辑:由于您详细说明了您想要执行的操作,因此这里是循环的更新版本。它不再循环两次,而是收集数据直到遇到“坏”行,并在遇到块分隔符时保存或丢弃收集的行。它不需要显式迭代器,因为它不会重新启动迭代,因此您只需向它传递一个行列表(或任何可迭代的)即可:

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

这里它正在运行:

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

The simplest approach is to go over these lines (assuming you have a list of lines, or a file, or split the string into a list of lines) until you see a line that's just '\n', then check that each line starts with '- ' (using the startswith string method) and slicing it off, storing the result, until you find another empty line. For example:

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

Edited: Since you elaborate on what you want to do, here's an updated version of the loops. It no longer loops twice, but instead collects data until it encounters a 'bad' line, and either saves or discards the collected lines when it encounters a block separator. It doesn't need an explicit iterator, because it doesn't restart iteration, so you can just pass it a list (or any iterable) of lines:

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

And here it is in action:

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]
比忠 2024-09-06 09:03:33
>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']
>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']
北笙凉宸 2024-09-06 09:03:33
l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

这样做:

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

并这样做:

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]
l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

do this:

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

and have this:

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文