Python - 搜索字符串,复制直到文档末尾

发布于 2024-12-21 04:52:08 字数 431 浏览 0 评论 0原文

我使用 python 一次打开一个 EML 文件,处理它们,然后将它们移动到另一个文件夹。 EML 文件包含一封包含标题的电子邮件。

EML 的前 35-40 行是标题信息,后面是实际的电子邮件消息。由于标题的行数发生变化,我不能只是将 EML 文件转换为列表并告诉它:

print emllist[37:]

但是,标题最后一行的开头始终相同,并以 X-OriginalArrivalTime 开头。

我的目标是解析我的 EML 文件,搜索 X-OriginalArrivalTime 所在的行号,然后将 EML 分成 2 个字符串,一个包含标头信息,另一个包含消息。

我一直在重读 python re 文档,但我似乎无法想出一个好方法来解决这个问题。

非常感谢任何帮助,

谢谢

I am using python to open EML files one at a time, process them then move them to another folder. EML file contains an email message including the headers.

The first 35-40 lines of the EML are header info, followed by the actual email message. Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it:

print emllist[37:]

However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime.

My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message.

I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this.

Any help is greatly appreciated

thanks

lou

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

掐死时间 2024-12-28 04:52:08

您也许可以避免使用正则表达式。怎么样:

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]

You can probably avoid regex. How about:

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]
瑾夏年华 2024-12-28 04:52:08

re 模块不太擅长计算行数。此外,您可能不需要它来检查行开头的内容。以下函数将 EML 文件的文件名作为输入,并返回一个包含两个字符串的元组:标题和消息。

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message

The re module is not very good at counting lines. What's more, you probably don't need it to check for the contents of the start of a line. The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message.

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message
深白境迁sunset 2024-12-28 04:52:08

match.groups(1)之后

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)

应包含标头和 match.groups(2) 电子邮件正文。 re.DOTALL 标志导致 . 匹配换行符。

, open('foo.eml').read(), re.DOTALL | re.MULTILINE)

应包含标头和 match.groups(2) 电子邮件正文。 re.DOTALL 标志导致 . 匹配换行符。

After

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)

match.groups(1) should contain the headers and match.groups(2) the email message's body. The re.DOTALL flag causes . to match newlines.

, open('foo.eml').read(), re.DOTALL | re.MULTILINE)

match.groups(1) should contain the headers and match.groups(2) the email message's body. The re.DOTALL flag causes . to match newlines.

自找没趣 2024-12-28 04:52:08

我不确定它是否适用于 eml 文件,但 python 有一个模块来处理电子邮件文件。

如果这不起作用,标题是否是用空行从消息中分割出来的?

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]

I am not sure if it works with eml files, but python has a module to work with email files.

If that does not work, isn't it true that headers are split from message with an empty-line?

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]
亣腦蒛氧 2024-12-28 04:52:08

没错,避免使用正则表达式会很有趣,但目前,由于您想将标头和消息分派到两个不同的字符串中,我认为 split() 消除了进行了分割,并且返回 3 元素的元组的 partition() 不适合目的,因此正则表达式保持兴趣:

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

结果

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'

That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split(), that eliminates the sequence on which the split is made, and partition(), that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest:

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

result

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文