在 C# 中使用正则表达式解析电子邮件标头

发布于 2024-11-03 23:15:59 字数 1675 浏览 4 评论 0原文

我有一个 Webhook 发布到我的 Web 应用程序上的表单,我需要解析电子邮件标头地址。

这是源文本:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <[email protected]>
To: <[email protected]>, [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

我希望提取以下内容:

<[email protected]>, [email protected], [email protected]

我一整天都在与正则表达式作斗争,但没有任何运气。

I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.

Here is the source text:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <[email protected]>
To: <[email protected]>, [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

I'm looking to pull out the following:

<[email protected]>, [email protected], [email protected]

I'm been struggling with Regex all day without any luck.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

衣神在巴黎 2024-11-10 23:15:59

与这里的一些帖子相反,我必须同意 mmutz 的观点,你无法使用正则表达式解析电子邮件......请参阅这篇文章:

https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1

3.4.1。地址规范规范

addr-spec 是特定的 Internet
包含本地的标识符
解释后的字符串
at 符号(“@”,ASCII 值
64) 后面跟着一个互联网域。

“本地解释”的想法意味着只有接收服务器才能解析它。

如果我要尝试解决这个问题,我会找到“To”行内容,将其分开并尝试使用 System.Net.Mail.MailAddress 解析每个段。

    static void Main()
    {
        string input = @"Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: ""Lastname, Firstname"" <[email protected]>
To: <[email protected]>, ""Yes, this is valid""@[emails are hard to parse!], [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";

        Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
        string to = toline.Match(input).Groups["to"].Value;

        int from = 0;
        int pos = 0;
        int found;
        string test;
        
        while(from < to.Length)
        {
            found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
            from = found + 1;
            test = to.Substring(pos, found - pos);

            try
            {
                System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                Console.WriteLine(addy.Address);
                pos = found + 1;
            }
            catch (FormatException)
            {
            }
        }
    }

上述程序的输出:

[email protected]
"Yes, this is valid"@[emails are hard to parse!]
[email protected]
[email protected]

Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:

https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1

3.4.1. Addr-spec specification

An addr-spec is a specific Internet
identifier that contains a locally
interpreted string followed by the
at-sign character ("@", ASCII value
64) followed by an Internet domain.

The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it.

If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.

    static void Main()
    {
        string input = @"Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: ""Lastname, Firstname"" <[email protected]>
To: <[email protected]>, ""Yes, this is valid""@[emails are hard to parse!], [email protected], [email protected]
Cc: <[email protected]>, [email protected]
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";

        Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
        string to = toline.Match(input).Groups["to"].Value;

        int from = 0;
        int pos = 0;
        int found;
        string test;
        
        while(from < to.Length)
        {
            found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
            from = found + 1;
            test = to.Substring(pos, found - pos);

            try
            {
                System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                Console.WriteLine(addy.Address);
                pos = found + 1;
            }
            catch (FormatException)
            {
            }
        }
    }

Output from the above program:

[email protected]
"Yes, this is valid"@[emails are hard to parse!]
[email protected]
[email protected]
月朦胧 2024-11-10 23:15:59

符合 RFC 2822 的电子邮件正则表达式是:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

只需在您的文本上运行它,您就会获得电子邮件地址。

当然,当正则表达式不是最佳选择时,总是可以选择不使用正则表达式。但取决于你!

The RFC 2822-compliant email regex is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Just run it over your text and you'll get the email addresses.

Of course, there's always the option of not using regex where regex isn't the best option. But up to you!

西瓜 2024-11-10 23:15:59

您不能使用正则表达式来解析 RFC2822 邮件,因为它们的语法包含递归产生式(在我的脑海中,它是用于注释((嵌套)注释)),这使得语法非-常规的。正则表达式(顾名思义)只能解析正则语法。

另请参阅RegEx 匹配开放标记(XHTML 自包含标记除外) 了解更多信息。

You cannot use regular expressions to parse RFC2822 mails, because their grammar contains a recursive production (off the top of my head, it was for comments (a (nested) comment)) which makes the grammar non-regular. Regular expressions (as the name suggests) can only parse regular grammars.

See also RegEx match open tags except XHTML self-contained tags for more information.

二智少女 2024-11-10 23:15:59

正如 Blindy 所建议的,有时您可以用老式的方式解析它。

如果您愿意这样做,这里有一个快速方法,假设电子邮件标题文本称为“标题”:

int start = header.IndexOf("To: ");
int end = header.IndexOf("Cc: ");
string x = header.Substring(start, end-start);

我在减法上可能会偏离一个字节,但您可以非常轻松地测试和修改它。当然,您还必须确保标题中始终有一个抄送:行,否则这将不起作用。

As Blindy suggests, sometimes you can just parse it out the old-fashioned way.

If you prefer to do that, here is a quick approach assuming the email header text is called 'header':

int start = header.IndexOf("To: ");
int end = header.IndexOf("Cc: ");
string x = header.Substring(start, end-start);

I may be off by a byte on the subtraction but you can very easily test and modify this. Of course you will also have to be certain you always will have a Cc: row in your header or this won't work.

梦里人 2024-11-10 23:15:59

此处详细介绍了使用正则表达式验证电子邮件,其中引用了 RFC 2822 的更实际实现with:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

看起来您只需要“收件人”字段之外的电子邮件地址,并且您已经得到了 <>也要担心,所以像下面这样的事情可能会起作用:

^To: ((?:\<?[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\>?,?(?:\s*))*)

同样,正如其他人提到的,您可能不想这样做。但是,如果您希望正则表达式将该输入转换为 <[email protected] >、[电子邮件受保护][电子邮件受保护],就可以了。

There's a breakdown of validating emails with regex here, which references a more practical implementation of RFC 2822 with:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

It also looks like you only want the email addresses out of the "To" field, and you've got the <> to worry about as well, so something like the following would likely work:

^To: ((?:\<?[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\>?,?(?:\s*))*)

Again, as others having mentioned, you might not want to do this. But if you want regex that will turn that input into <[email protected]>, [email protected], [email protected], that'll do it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文