Python - 搜索字符串,复制直到文档末尾
我使用 python 一次打开一个 EML 文件,处理它们,然后将它们移动到另一个文件夹。 EML 文件包含一封包含标题的电子邮件。
EML 的前 35-40 行是标题信息,后面是实际的电子邮件消息。由于标题的行数发生变化,我不能只是将 EML 文件转换为列表并告诉它:
print emllist[37:]
但是,标题最后一行的开头始终相同,并以 X-OriginalArrivalTime 开头。
我的目标是解析我的 EML 文件,搜索 X-OriginalArrivalTime 所在的行号,然后将 EML 分成 2 个字符串,一个包含标头信息,另一个包含消息。
我一直在重读 python re 文档,但我似乎无法想出一个好方法来解决这个问题。
非常感谢任何帮助,
谢谢
卢
I am using python to open EML files one at a time, process them then move them to another folder. EML file contains an email message including the headers.
The first 35-40 lines of the EML are header info, followed by the actual email message. Since the amount of lines of the header changes, I cant just convert my EML file to a list and tell it:
print emllist[37:]
However, the beginning of the last line of the headers is always the same and begins with X-OriginalArrivalTime.
My goal is to parse my EML file, search for the line number X-OriginalArrivalTime is on and then split the EML into 2 strings, one containing the headers info and one containing the message.
I have been rereading the python re documentation, but I cant seem to come up with a good way to attack this.
Any help is greatly appreciated
thanks
lou
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您也许可以避免使用正则表达式。怎么样:
You can probably avoid regex. How about:
re
模块不太擅长计算行数。此外,您可能不需要它来检查行开头的内容。以下函数将 EML 文件的文件名作为输入,并返回一个包含两个字符串的元组:标题和消息。The
re
module is not very good at counting lines. What's more, you probably don't need it to check for the contents of the start of a line. The following function takes the filename of the EML file as input and returns a tuple containing two strings: the header, and the message.match.groups(1)
之后应包含标头和
match.groups(2)
电子邮件正文。re.DOTALL
标志导致.
匹配换行符。After
match.groups(1)
should contain the headers andmatch.groups(2)
the email message's body. There.DOTALL
flag causes.
to match newlines.我不确定它是否适用于 eml 文件,但 python 有一个模块来处理电子邮件文件。
如果这不起作用,标题是否是用空行从消息中分割出来的?
I am not sure if it works with eml files, but python has a module to work with email files.
If that does not work, isn't it true that headers are split from message with an empty-line?
没错,避免使用正则表达式会很有趣,但目前,由于您想将标头和消息分派到两个不同的字符串中,我认为 split() 消除了进行了分割,并且返回 3 元素的元组的 partition() 不适合目的,因此正则表达式保持兴趣:
结果
That's right that it would be interesting to avoid a regex, but presently, since you want to dispatch the header and the message into TWO different strings, I think that split(), that eliminates the sequence on which the split is made, and partition(), that returns a tuple of 3 elements, do not fit for the purpose , so a regex keeps interest:
result