Python - 搜索字符串，复制直到文档末尾

发布于 2024-12-21 04:52:08 字数 431 浏览 0 评论 0原文

我使用 python 一次打开一个 EML 文件，处理它们，然后将它们移动到另一个文件夹。 EML 文件包含一封包含标题的电子邮件。

EML 的前 35-40 行是标题信息，后面是实际的电子邮件消息。由于标题的行数发生变化，我不能只是将 EML 文件转换为列表并告诉它：

print emllist[37:]

但是，标题最后一行的开头始终相同，并以 X-OriginalArrivalTime 开头。

我的目标是解析我的 EML 文件，搜索 X-OriginalArrivalTime 所在的行号，然后将 EML 分成 2 个字符串，一个包含标头信息，另一个包含消息。

我一直在重读 python re 文档，但我似乎无法想出一个好方法来解决这个问题。

非常感谢任何帮助，

谢谢

卢

原文

I am using python to open EML files one at a time, process them then move them to another folder. EML file contains an email message including the headers.

The first 35-40 lines of the EML are header info, followed by the actual email message. Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it:

print emllist[37:]

However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime.

My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message.

I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this.

Any help is greatly appreciated

thanks

lou

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掐死时间 2024-12-28 04:52:08

您也许可以避免使用正则表达式。怎么样：

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]

You can probably avoid regex. How about:

msg = data.split('X-OriginalArrivalTime', 1)[1].split('\n', 1)[1]

回复收藏 0 原文

瑾夏年华 2024-12-28 04:52:08

re 模块不太擅长计算行数。此外，您可能不需要它来检查行开头的内容。以下函数将 EML 文件的文件名作为输入，并返回一个包含两个字符串的元组：标题和消息。

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message

The re module is not very good at counting lines. What's more, you probably don't need it to check for the contents of the start of a line. The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message.

def process_eml(filename):
    with open(filename) as fp:
        lines = fp.readlines()

    for i, line in enumerate(lines):
        if line.startswith("X-OriginalArrivalTime"):
             break
    else:
        raise Exception("End of header not found")

    header = '\n'.join(lines[:i+1]) # Message starts at i + 1
    message = '\n'.join(lines[i+1:])

    return header, message

回复收藏 0 原文

深白境迁sunset 2024-12-28 04:52:08

match.groups(1)之后

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)
应包含标头和 match.groups(2) 电子邮件正文。 re.DOTALL 标志导致 . 匹配换行符。
,
                  open('foo.eml').read(),
                  re.DOTALL | re.MULTILINE)

应包含标头和 match.groups(2) 电子邮件正文。 re.DOTALL 标志导致 . 匹配换行符。

After

match = re.search(r'(.*^X-OriginalArrivalTime[^\n]*\n+)(.*)
match.groups(1) should contain the headers and match.groups(2) the email message's body. The re.DOTALL flag causes . to match newlines.
,
                  open('foo.eml').read(),
                  re.DOTALL | re.MULTILINE)

match.groups(1) should contain the headers and match.groups(2) the email message's body. The re.DOTALL flag causes . to match newlines.

回复收藏 0 原文

自找没趣 2024-12-28 04:52:08

我不确定它是否适用于 eml 文件，但 python 有一个模块来处理电子邮件文件。

如果这不起作用，标题是否是用空行从消息中分割出来的？

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]

I am not sure if it works with eml files, but python has a module to work with email files.

If that does not work, isn't it true that headers are split from message with an empty-line?

lines = fp.readlines()
header_end = lines.index('\n') # first empty line, I think it is the end of header.
headers = lines[:header_end]
message = lines[header_end:]

回复收藏 0 原文

亣腦蒛氧 2024-12-28 04:52:08

没错，避免使用正则表达式会很有趣，但目前，由于您想将标头和消息分派到两个不同的字符串中，我认为 split() 消除了进行了分割，并且返回 3 元素的元组的 partition() 不适合目的，因此正则表达式保持兴趣：

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

结果

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'

That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split(), that eliminates the sequence on which the split is made, and partition(), that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest:

import re

regx = re.compile('(.+?X-OriginalArrivalTime\.[^\n]*[\r\n]+)'
                  '(.+)\Z',
                  re.DOTALL)

ss = ('blahblah blah\r\n'
      'totoro tootrototo \r\n'
      'erteruuty\r\n'
      'X-OriginalArrivalTime. 12h58 Huntington Point\r\n'
      'body begins here\r\n'
      'sdkjhqsdlfkghqdlfghqdfg\r\n'
      '23135468796786876544\r\n'
      'ldkshfqskdjf end of file\r\n')


header,message = regx.match(ss).groups()

print 'header :\n',repr(header)
print
print 'message :\n',repr(message)

result

header :
'blahblah blah\r\ntotoro tootrototo \r\nerteruuty\r\nX-OriginalArrivalTime. 12h58 Huntington Point\r\n'

message :
'body begins here\r\nsdkjhqsdlfkghqdlfghqdfg\r\n23135468796786876544\r\nldkshfqskdjf end of file\r\n'

回复收藏 0 原文

~没有更多了~