正则表达式的替代(流畅?)界面设计

发布于 2024-08-08 14:26:53 字数 11494 浏览 6 评论 0 原文

我刚刚看到了一个巨大的 Java 正则表达式,这让我对正则表达式的一般可维护性进行了一些思考。我相信大多数人——除了一些糟糕的 Perl 贩子——都会同意正则表达式很难维护。

我正在考虑如何解决这种情况。到目前为止,我最有希望的想法是使用流畅的界面。举个例子,而不是:

Pattern pattern = Pattern.compile("a*|b{2,5}");

人们可以写这样的东西

import static util.PatternBuilder.*

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

Pattern alternative = 
  or(
    string("a").anyTimes(),
    string("b").times(2,5)
  )
  .compile();

在这个非常简短的例子中,创建正则表达式的常见方法对于任何平庸的天才开发人员来说仍然很容易阅读。然而,想想那些充满两行或更多行、每行 80 个字符的怪异表达式。当然,(详细的)流畅的界面需要几行而不是两行,但我确信它会更具可读性(因此可维护)。

现在我的问题是:

  1. 你知道任何类似的正则表达式方法吗?

  2. 您是否同意这种方法比使用简单字符串更好?

  3. 您将如何设计 API?

  4. 您会在您的项目中使用如此简洁的实用程序吗?

  5. 您认为实施起来有趣吗? ;)

编辑: 想象一下,可能存在比我们在正则表达式中都没有的简单构造更高级别的方法,例如

// matches [email protected] - think of it as reusable expressions
Pattern p = string{"a").anyTimes().string("b@").domain().compile();

编辑 - 评论简短摘要:

有趣的是,大多数人认为正则表达式将继续存在——尽管需要工具来阅读它们,并且需要聪明的人来想办法使它们可维护。虽然我不确定流畅的界面是最好的方法,但我确信一些聪明的工程师 - 我们? ;) - 应该花一些时间让正则表达式成为过去 - 它们已经陪伴我们 50 年了,知道这一点就足够了,你不觉得吗?

开放赏金

赏金将颁发给正则表达式新方法的最佳创意(无需代码)。

编辑 - 一个很好的例子:

这是我正在谈论的那种模式 - 对第一个能够翻译它的人给予额外的荣誉 - RegexBuddies 允许(顺便说一句,它来自 Apache 项目)< /strike> 额外荣誉授予 chiimez:这是一种符合 RFC 的电子邮件地址验证模式 - 尽管其 RFC822(参见 ex-parrot.com< /a>),而不是 5322 - 但不确定是否有区别 - 如果是的,我会为适合 5322 的补丁授予下一个额外的荣誉;)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

I've just seen a huge regex for Java that made me think a little about maintainability of regular expressions in general. I believe that most people - except some badass perl mongers - would agree that regular expressions are hardly maintainable.

I was thinking about how this situation could be fixed. So far, the most promising idea I have is using a fluent interface. To give an example, instead of:

Pattern pattern = Pattern.compile("a*|b{2,5}");

one could write something like this

import static util.PatternBuilder.*

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

Pattern alternative = 
  or(
    string("a").anyTimes(),
    string("b").times(2,5)
  )
  .compile();

In this very short example, the common way of creating regular expression is still quite readable to any mediocre talented developer. However, think about those spooky expressions that fill two or more lines with 80 characters each. Sure, the (verbose) fluent interface would require several lines instead of only two, but I'm sure it would be much more readable (hence maintainable).

Now my questions:

  1. Do you know of any similar approach to regular expressions?

  2. Do you agree that this approach could be better than using simple strings?

  3. How would you design the API?

  4. Would you use such a neat utility in your projects?

  5. Do you think this would be fun implementing? ;)

EDIT:
Imagine that there could be methods that are on a higher level than simple constructs we all no from regex, e.g.

// matches [email protected] - think of it as reusable expressions
Pattern p = string{"a").anyTimes().string("b@").domain().compile();

EDIT - short summary of comments:

It's interesting to read that most people think that regular expressions are here to stay - although it takes tools to read them and smart guys to think of ways to make them maintainable. While I'm not sure that a fluent interface is the best way to go, I'm sure that some smart engineers - we? ;) - should spend some time to make regular expressions a thing of the past - it's enough that they've been with us for 50 years know, don't you think?

OPEN BOUNTY

The bounty will be awarded to the best idea (no code required) for a new approach to regular expressions.

EDIT - A NICE EXAMPLE:

here's the kind of pattern I'm talking about - extra kudos to the first one who's able to translate it - RegexBuddies allowed (it's from an Apache project btw) extra kudos awarded to chii and mez: it's a RFC compliant email address validation pattern - although its RFC822 (see ex-parrot.com), not 5322 - not sure if there is a difference though - if yes, I'll award next extra kudos for a patch to fit 5322 ;)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(19

迷迭香的记忆 2024-08-15 14:26:54

Martin Fowler 建议另一种策略。即获取正则表达式中有意义的部分并用变量替换它们。他使用了以下示例:

 "^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)" 

变得

   String scoreKeyword = "^score\s+";
   String numberOfPoints = "(\d+)";
   String forKeyword = "\s+for\s+";
   String numberOfNights = "(\d+)";
   String nightsAtKeyword = "\s+nights?\s+at\s+";
   String hotelName = "(.*)";

   String pattern = scoreKeyword + numberOfPoints +
      forKeyword + numberOfNights + nightsAtKeyword + hotelName;

更具可读性和可维护性。

上一个示例的目标是从字符串列表中解析 numberOfPoints、numberOfNights 和 hotelName,如下所示:

score 400 for 2 nights at Minas Tirith Airport

Martin Fowler suggest another strategy. Namely taking meaningful parts of the regex and replace them by variables. He uses following example:

 "^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)" 

becomes

   String scoreKeyword = "^score\s+";
   String numberOfPoints = "(\d+)";
   String forKeyword = "\s+for\s+";
   String numberOfNights = "(\d+)";
   String nightsAtKeyword = "\s+nights?\s+at\s+";
   String hotelName = "(.*)";

   String pattern = scoreKeyword + numberOfPoints +
      forKeyword + numberOfNights + nightsAtKeyword + hotelName;

Which is much more readable and maintainable.

The goal of the previous example was to parse numberOfPoints, numberOfNights and hotelName out of a List of Strings like:

score 400 for 2 nights at Minas Tirith Airport
妞丶爷亲个 2024-08-15 14:26:54

对于没有任何正则表达式经验的人来说可能会稍微容易一些,但是在有人学习了您的系统之后,他仍然无法在其他地方读取正常的正则表达式。

另外,我认为您的版本对于正则表达式专家来说更难阅读。

我建议像这样注释正则表达式:

Pattern pattern = Pattern.compile(
  "a*     # Find 0 or more a        \n" +
  "|      # ... or ...              \n" +
  "b{2,5} # Find between 2 and 5 b  \n",
Pattern.COMMENTS);

这对于任何经验水平的人来说都很容易阅读,并且对于没有经验的人来说,它同时教授正则表达式。此外,注释可以根据具体情况进行定制,以解释正则表达式背后的业务规则,而不仅仅是结构。

此外,像 RegexBuddy 这样的工具可以获取您的正则表达式并将其转换为:

Match either the regular expression below (attempting the next alternative only if this one fails) «a*»
   Match the character "a" literally «a*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «b{2,5}»
   Match the character "b" literally «b{2,5}»
      Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»

It might be slightly easier for someone without any regex experience, but after someone learns your system, he still won't be able to read a normal regex elsewhere.

Also, I think your version is harder to read for a regex expert.

I would recommend annotating the regex like this:

Pattern pattern = Pattern.compile(
  "a*     # Find 0 or more a        \n" +
  "|      # ... or ...              \n" +
  "b{2,5} # Find between 2 and 5 b  \n",
Pattern.COMMENTS);

This is easy to read for any experience level, and for the inexperienced, it teaches regex at the same time. Also, the comments can be tailored to the situation to explain the business rules behind the regex rather than just the structure.

Also, a tool like RegexBuddy can take your regex and translate it into:

Match either the regular expression below (attempting the next alternative only if this one fails) «a*»
   Match the character "a" literally «a*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «b{2,5}»
   Match the character "b" literally «b{2,5}»
      Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
时光暖心i 2024-08-15 14:26:54

这是一个有趣的概念,但它的提出存在一些缺陷。

但首先回答所提出的关键问题:

现在我的问题:

1。您知道任何类似的正则表达式方法吗?

没有一个没有被提及过。以及我通过阅读问题和答案发现的内容。

2.您是否同意这种方法比使用简单字符串更好?

如果它像宣传的那样工作,它肯定会使事情更容易调试。

3.您将如何设计 API?

请参阅下一节中我的注释。我以您的示例和链接的 .NET 库作为起点。

4.您会在您的项目中使用如此简洁的实用程序吗?

尚未决定。我使用当前版本的神秘正则表达式没有任何问题。我需要一个工具来将现有的正则表达式转换为流畅的语言版本。

5.您认为实施起来会很有趣吗? ;)

与编写实际代码相比,我更喜欢研究更高层次的方法。这解释了这个答案的文字墙。


以下是我注意到的几个问题以及我处理这些问题的方法。

结构不清晰。

您的示例似乎通过连接字符串来创建正则表达式。这不是很稳健。我认为这些方法应该添加到 String 和 Patern/Regex 对象中,因为这将使实现和代码更清晰。除此之外,它类似于正则表达式的经典定义方式。

仅仅因为我看不到它以任何其他方式工作,我对提议方案的其余注释将假设所有方法都作用于并返回 Pattern 对象。

编辑:我似乎自始至终都使用了以下约定。所以我已经澄清了它们并将它们移到了这里。

  • 实例方法:模式增强。例如:捕获、重复、查看断言。

  • 运算符:操作顺序。交替、串联

  • 常量:字符类、边界(代替 \w、$、\b 等)

交替 捕获/聚类如何处理?

捕获是正则表达式的重要组成部分。

我看到每个 Pattern 对象在内部存储为一个集群。 Perl 术语中的 (?: 模式)。允许图案标记轻松混合和混合,而不会干扰其他部分。

我希望捕获作为模式上的实例方法来完成。使用一个变量来存储匹配的字符串。

pattern.capture(variable) 会将模式存储在变量中。如果捕获是要多次匹配的表达式的一部分,则变量应包含模式所有匹配的字符串数组。

流利的语言可能非常含糊。

流利语言不太适合正则表达式的递归性质。因此需要考虑操作顺序。仅将方法链接在一起并不允许使用非常复杂的正则表达式。这种工具正是在这种情况下有用的。

是否

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

产生 /a*|b{2,5}//(a*|b){2,5}/

这样的方案将如何处理嵌套交替?例如: /a*|b(c|d)|e/

我看到了在正则表达式中处理交替的三种方法

  1. 作为运算符: pattern1 或 pattern2 => pattern # /pattern1|pattern2/
  2. 作为类方法: Pattern.or(pattern1,pattern2[,pattern3]*) =>; pattern # /pattern1|pattern2|pattern3|...|/
  3. 作为实例方法:pattern1.or(pattern2) => pattern # /pattern1|pattern2/

我会以同样的方式处理串联。

  1. 作为运算符:pattern1 + pattern2 => pattern # /pattern1pattern2/
  2. 作为类方法: Pattern.concatenate(pattern1,pattern2[,pattern3]*) =>; pattern # /pattern1pattern2pattern3.../
  3. 作为实例方法:pattern1.then(pattern2) => pattern # /pattern1pattern2/

如何扩展常用模式

建议的方案使用 .domain() 这似乎是一个常见的正则表达式。将用户定义的模式视为方法并不容易添加新模式。在像 Java 这样的语言中,库的用户必须重写该类才能为常用模式添加方法。

我的建议是解决这个问题,将每件作品都视为一个对象。可以为每个常用的正则表达式创建一个模式对象,例如匹配域。鉴于我之前关于捕获的想法,确保捕获适用于包含捕获部分的同一常见模式的多个副本并不难。

还应该有与各种字符类匹配的模式常量。

零宽度环视断言

扩展了我的想法,即所有片段都应该隐式聚集。使用实例方法环顾断言应该不会太难。

pattern.zeroWidthLookBehind() 会生成 (?


仍然需要考虑的事情。

  • 命名捕获不会太难
  • 反向引用:希望前面讨论的如何实际实现它的 。我没有过多考虑内部结构。这才是真正的魔法发生的地方。
  • 翻译:确实应该有一个工具可以在经典正则表达式(比如 Perl 方言)和新方案之间进行翻译。从新方案进行翻译可能是包的一部分

将其全部放在一起,我提出的与电子邮件地址匹配的模式版本:

Pattern domain_label = LETTER_CHARACTER + (LETTER_CHARACTER or "-" or DIGIT_CHARACTER).anyTimes()
Pattern domain = domain_label + ("." + domain_label).anyTimes()
Pattern pattern = (LETTER_CHARACTER + ALPHANUMERIC_CHARACTER + "@" + domain).compile

事后看来,我的方案大量借鉴了 Martin Fowler 正在使用的方法。尽管我并不打算让事情发展成这样,但这确实使使用这样的系统更易于维护。它还解决了福勒方法(捕获顺序)的一两个问题。

This is an intriguing concept, but as it is presented there are a few flaws.

But first answers to the key questions asked:

Now my questions:

1. Do you know of any similar approach to regular expressions?

None that haven't been mentioned already. And those I found out about by reading the question and answers.

2. Do you agree that this approach could be better than using simple strings?

If it works as advertised, it would definitely makes things much easier to debug.

3. How would you design the API?

See my notes in the next section. I take your examples and the linked .NET library as a starting point.

4. Would you use such a neat utility in your projects?

Undecided. I have no problem working with the current version of cryptic regularly expressions. And I would need a tool to convert existing regular expressions to the fluent language version.

5. Do you think this would be fun implementing? ;)

I'd enjoy working out the higher level how, than writing the actual code. Which explains the wall of text that is this answer.


Here are a couple of problems I noticed, and the way I would deal with it.

Unclear structure.

Your example seems to create a regex by concatenating on to strings. This is not very robust. I believe that these methods should be added to String and Patern/Regex objects, because it will make implementation and code cleaner. Besides it's similar to the way Regular Expressions are classically defined.

Just because I can't see it working any other way, the rest of my annotations to the proposed scheme will assume that all methods act on and return Pattern objects.

Edit: I seem to have used the following conventions throughout. So I've clarified them and move them up here.

  • Instance methods: pattern augmentation. eg: capturing, repetition, look around assertions.

  • Operators: order of operations. alternation, concatenation

  • Constants: character classes, boundaries (inplace of \w, $, \b, etc)

How will capturing/clustering be handled?

Capturing is a huge part of regular expressions.

I see each Pattern object being stored internally as a cluster. (?: pattern) in Perl terms. Allowing for pattern tokens to easily be mixed and mingled without interfering with other pieces.

I expect capturing to be done as an instance method on a Pattern. Taking a variable to store the matching string[s] in.

pattern.capture(variable) would store pattern in variable. In the event that the capture is part of an expression to be matched multiple times, variable should contain an array of strings of all matches for pattern.

Fluent languages can be very ambiguous.

Fluent languages are not well suited to the recursive nature of regular expressions. So consideration needs to be given to the order of operations. Just chaining methods together does not allow for very complex regular expressions. Exactly the situation where such a tool would be useful.

Does

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

produce /a*|b{2,5}/ or /(a*|b){2,5}/ ?

How would such a scheme handle nested alternation? Eg: /a*|b(c|d)|e/

I see three ways of handling alternation in regular expressions

  1. As an operator: pattern1 or pattern2 => pattern # /pattern1|pattern2/
  2. As a class method: Pattern.or( pattern1, pattern2[, pattern3]*) => pattern # /pattern1|patern2|patern3|...|/
  3. As a instance method: pattern1.or(pattern2) => pattern # /pattern1|patern2/

I would handle concatenation the same way.

  1. As an operator: pattern1 + pattern2 => pattern # /pattern1pattern2/
  2. As a class method: Pattern.concatenate( pattern1, pattern2[, pattern3]*) => pattern # /pattern1patern2patern3.../
  3. As a instance method: pattern1.then(pattern2) => pattern # /pattern1patern2/

How to extend for commonly used patterns

The proposed scheme used .domain() which seems to be a common regex. To treat user defined patterns as a methods doesn't make it easy to add new patterns. In a language like Java users of the library would have to override the class to add methods for commonly used patterns.

This is addressed by my suggestion to treat every piece as an object. A pattern object can be created for each commonly used regex like matching a domain for example. Given my previous thoughts about capturing it wouldn't too be hard to ensure that capturing works for multiple copies of the same common pattern that contains a captured section.

There should also be constants for patterns matching various character classes.

Zero width look around assertions

Expanding on my thoughts that all pieces should be implicitly clustered. Look around assertions shouldn't be too hard too do with an instance method.

pattern.zeroWidthLookBehind() would produce (?<patten).


Things that still need to be considered.

  • Backreferences: Hopefully not too hard with the named capturing discussed earlier
  • How to actually implement it. I haven't given the internals too much thought. It's where the real magic is going to happen.
  • Translation: There really should be a tool translate to and from classic regexs (say Perl dialect) and the new scheme. Translating from the new scheme could be a part of the package

Putting it all together, my proposed version of a pattern that matches an email address:

Pattern domain_label = LETTER_CHARACTER + (LETTER_CHARACTER or "-" or DIGIT_CHARACTER).anyTimes()
Pattern domain = domain_label + ("." + domain_label).anyTimes()
Pattern pattern = (LETTER_CHARACTER + ALPHANUMERIC_CHARACTER + "@" + domain).compile

In hindsight, my scheme borrows heavily from Martin Fowler's approach in use. Although I didn't intend for things to go this way, it definitely makes using such a system more maintainable. It also addresses a problem or two with Fowler's approach (capturing order).

旧时光的容颜 2024-08-15 14:26:54

您将如何设计 API?

我会借用 Hibernate criteria API 中的一个页面。而不是使用:

string("a").anyTimes().or().string("b").times(2,5).compile()

使用这样的模式:

Pattern.or(Pattern.anyTimes("a"), Pattern.times("b", 2, 5)).compile()

这种表示法更简洁一些,我觉得更容易理解模式的层次结构/结构。每个方法都可以接受字符串或模式片段作为第一个参数。

您知道任何类似的正则表达式方法吗?

不是即兴的,不是。

您是否同意这种方法比使用简单字符串更好?

是的,绝对如此......如果您使用正则表达式来处理任何远程复杂的事情。对于非常短且简单的情况,字符串更方便。

您会在您的项目中使用这样一个简洁的实用程序吗?

有可能,因为它已经被证明/稳定了...将其转移到像 Apache Commons 这样的更大的实用程序项目中可能会更好。

您认为实施起来有趣吗? ;)

+1

How would you design the API?

I would borrow a page from the Hibernate criteria API. Instead of using:

string("a").anyTimes().or().string("b").times(2,5).compile()

Use a pattern like:

Pattern.or(Pattern.anyTimes("a"), Pattern.times("b", 2, 5)).compile()

This notation is a bit more concise, and I feel that it's easier to understand the hierarchy/structure of the pattern. Each method could accept either a string, or a pattern fragment, as the first argument.

Do you know of any similar approach to regular expressions?

Not offhand, no.

Do you agree that this approach could be better than using simple strings?

Yes, absolutetely... if you are using regex's for anything remotely complicated. For very short and simple cases, a string is more convenient.

Would you use such a neat utility in your projects?

Possibly, as it became proven/stable... rolling it into a larger utility project like Apache Commons might be a plus.

Do you think this would be fun implementing? ;)

+1

数理化全能战士 2024-08-15 14:26:54

我自己的尝试可以在 GitHub 上找到。虽然我认为它不值得用于简单的表达式,但除了可读性的改进之外,它确实提供了一些优点:

  • 它负责括号匹配
  • 它处理所有“特殊”字符的转义,这可能很快导致反斜杠

地狱几个简单的例子:

 // Matches a single digit
    RegExBuilder.build(anyDigit()); // "[0-9]"

 // Matches exactly 2 digits
    RegExBuilder.build(exactly(2).of(anyDigit())); // "[0-9]{2}"

 // Matches between 2 and 4 letters
    RegExBuilder.build(between(2,4).of(anyLetter())); // "[a-zA-Z]{2,4}"

还有一个更复杂的例子(或多或少验证电子邮件地址):

final Token ALPHA_NUM = anyOneOf(range('A','Z'), range('a','z'), range('0','9'));
final Token ALPHA_NUM_HYPEN_UNDERSCORE = anyOneOf(characters('_','-'), range('A','Z'), range('a','z'), range('0','9'));

String regexText = RegExBuilder.build(
 // Before the '@' symbol we can have letters, numbers, underscores and hyphens anywhere
    oneOrMore().of(
        ALPHA_NUM_HYPEN_UNDERSCORE
    ),
    zeroOrMore().of(
        text("."), // Periods are also allowed in the name, but not as the initial character
        oneOrMore().of(
            ALPHA_NUM_HYPEN_UNDERSCORE
        )
    ),
    text("@"),
 // Everything else is the domain name - only letters, numbers and periods here
    oneOrMore().of( 
        ALPHA_NUM
    ),
    zeroOrMore().of(
        text("."), // Periods must not be the first character in the domain
        oneOrMore().of(
            ALPHA_NUM
        )
    ),
    text("."), // At least one period is required
    atLeast(2).of( // Period must be followed by at least 2 letters (this is the TLD)
        anyLetter()
    )
);

My own humble attempt can be found on GitHub. Although I don't think it's worth using for simple expressions, it does provide a couple of advantages apart from the readability improvement:

  • It takes care of bracket matching
  • It handles escaping of all the 'special' characters which can quickly lead to backslash hell

A few simple examples:

 // Matches a single digit
    RegExBuilder.build(anyDigit()); // "[0-9]"

 // Matches exactly 2 digits
    RegExBuilder.build(exactly(2).of(anyDigit())); // "[0-9]{2}"

 // Matches between 2 and 4 letters
    RegExBuilder.build(between(2,4).of(anyLetter())); // "[a-zA-Z]{2,4}"

And a more complicated one (which more or less validates email addresses):

final Token ALPHA_NUM = anyOneOf(range('A','Z'), range('a','z'), range('0','9'));
final Token ALPHA_NUM_HYPEN_UNDERSCORE = anyOneOf(characters('_','-'), range('A','Z'), range('a','z'), range('0','9'));

String regexText = RegExBuilder.build(
 // Before the '@' symbol we can have letters, numbers, underscores and hyphens anywhere
    oneOrMore().of(
        ALPHA_NUM_HYPEN_UNDERSCORE
    ),
    zeroOrMore().of(
        text("."), // Periods are also allowed in the name, but not as the initial character
        oneOrMore().of(
            ALPHA_NUM_HYPEN_UNDERSCORE
        )
    ),
    text("@"),
 // Everything else is the domain name - only letters, numbers and periods here
    oneOrMore().of( 
        ALPHA_NUM
    ),
    zeroOrMore().of(
        text("."), // Periods must not be the first character in the domain
        oneOrMore().of(
            ALPHA_NUM
        )
    ),
    text("."), // At least one period is required
    atLeast(2).of( // Period must be followed by at least 2 letters (this is the TLD)
        anyLetter()
    )
);
我们只是彼此的过ke 2024-08-15 14:26:54

有一个适用于 .NET 的流畅的正则表达式库

There's a fluent regexps library for .NET.

素手挽清风 2024-08-15 14:26:54

简短的回答:我已经从检查和编译的角度看到了它,我认为这是需要考虑的事情。

长答案:
我工作的公司为企业内容过滤应用程序提供基于硬件的正则表达式引擎。考虑在网络路由器中以 20GB/秒的速度运行防病毒或防火墙应用程序,而不是占用宝贵的服务器或处理器周期。大多数防病毒、反垃圾邮件或防火墙应用程序的核心都是一堆正则表达式。

无论如何,正则表达式的编写方式对扫描性能有巨大影响。您可以用几种不同的方式编写正则表达式来完成同一件事,有些方式的性能会快得多。我们为客户编写了编译器和 linter,帮助他们维护和调整表达式。

回到OP的问题,我不会定义一个全新的语法,而是编写一个linter(抱歉,我们的语法是专有的),将正则表达式剪切并粘贴到其中,这将分解旧的正则表达式并输出“流利的英语”,以便某人更好地理解。我还会添加相对性能检查和常见修改建议。

Short answer: I have seen it approached from a linting and compiling angle, which I think is something to consider for this.

Long answer:
The company I work for makes hardware-based regular expression engines for enterprise content filtering applications. Think running anti-virus or firewall applications at 20GB/sec speeds right in the network routers rather than taking up valuable server or processor cycles. Most anti-virus, anti-spam or firewall apps are a bunch of regex expressions at the core.

Anyway, the way the regex's are written has a huge impact on the performance of the scanning. You can write regexs in several different ways to do the same thing, and some will have drastically faster performance. We've written compilers and linters for our customers to help them maintain and tune their expressions.

Back to the OP's question, rather than defining a whole new syntax, I would write a linter (sorry, ours is proprietary) cut and paste regex's into that will break down legacy regex's and output "fluent english" for someone to understand better. I'd also add relative performance checks and suggestions for common modifications.

唐婉 2024-08-15 14:26:54

对我来说,简短的答案是,一旦您遇到足够长的正则表达式(或执行相同操作的其他模式匹配),就会导致问题......您可能应该考虑它们是否是正确的工具首先是为了这份工作。

老实说,任何流畅的界面似乎都比标准正则表达式更难阅读。对于非常短的表达,流利的版本是冗长的,但不会太长;它是可读的。但那么长的东西的正则表达式也是如此。

对于中等大小的正则表达式,流畅的界面变得笨拙;足够长,即使不是不可能,也很难阅读。

对于很长的正则表达式(即电子邮件地址),正则表达式实际上很难(如果不是不可能)阅读,流畅的版本在 10 页前就变得无法阅读。

The short answer, for me, is that, once you get to regular expressions (or other pattern matching that does the same thing) that are long enough to cause a problem... you should probably be considering if they're the right tool for the job in the first place.

Honestly, any fluent interface seems like it would be harder to read than a standard regular expression. For really short expressions, the fluent version is verbose, but not too long; it's readable. But so is the regular expression for something that long.

For a medium sized regular expression, a fluent interface becomes unwieldy; long enough that it's hard, if not impossible, to read.

For a long regular expression (ie, the email address one), where the regular expression is actually hard (if not impossible) to read, the fluent version became impossible to read 10 pages ago.

梦里°也失望 2024-08-15 14:26:54

我最近有同样的想法

想自己实现它,但后来我发现了 VerbalExpressions

I recently had this same idea.

Thought of implementing it myself, but then I found VerbalExpressions.

我的影子我的梦 2024-08-15 14:26:54

您知道任何类似的正则表达式方法吗?

不知道,除了之前的答案

您是否同意这种方法比使用简单字符串更好?

有点 - 我我认为我们可以使用更具描述性的标记,而不是用单个字符来表示结构,但我怀疑这会让复杂的逻辑变得更清晰。

您将如何设计 API?

将每个构造转换为方法名称,并允许嵌套函数调用,以便很容易获取字符串并将方法名称替换为其中。

我认为大部分价值在于定义一个强大的实用函数库,例如匹配可以自定义的“电子邮件”、“电话号码”、“不包含 X 的行”等。

你会在你的项目中使用这样一个简洁的实用程序吗?

也许 - 但可能只适用于较长的项目,在这些项目中,调试函数调用比调试字符串编辑更容易,或者有一个很好的实用程序函数准备使用。

您认为实施起来有趣吗? ;)

当然!

Do you know of any similar approach to regular expressions?

No, except for the previous answer

Do you agree that this approach could be better than using simple strings?

Sort of - I think instead of single characters to represent constructs, we could use more descriptive markup, but I doubt it would make complex logic any clearer.

How would you design the API?

Translate each construct into a method name, and allow nested function calls so that it's very easy to take a string and substitute method names into it.

I think the majority of the value will be in defining a robust library of utility functions, like matching "emails", "phone numbers", "lines that don't contain X", etc. that can be customized.

Would you use such a neat utility in your projects?

Maybe - but probably only for the longer ones, where it would be easier to debug function calls than debugging string editing, or where there is a nice utility function ready to use.

Do you think this would be fun implementing? ;)

Of course!

星星的轨迹 2024-08-15 14:26:54

回答问题的最后一部分(为了点赞)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

匹配符合 RFC 的电子邮件地址:D

In answer to the last part of the question (for Kudos)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

matches RFC compliant email addresses :D

淡墨 2024-08-15 14:26:54

正则表达式是有限状态机的描述。经典的文本表示不一定是坏事。它结构紧凑,相对明确,并且得到了很好的采用。

更好的表示可能是状态转换图,但这可能很难在源代码中使用。

一种可能性是从各种容器和组合器对象构建它。

大致如下(将其从伪代码转换为选择的语言留作热切者的练习):

domainlabel = oneormore(characterclass("a-zA-Z0-9-"))
separator = literal(".")
domain = sequence(oneormore(sequence(domainlabel, separator)), domainlabel)
localpart = oneormore(characterclassnot("@"))
emailaddress = sequence(localpart, literal("@"), domain)

请注意,上述内容将任意许多电子邮件地址错误地分类为有效,但不符合它们的语法需要遵循,因为该语法需要的不仅仅是简单的 FSM 来进行完整解析。不过,我不认为它会将有效地址错误分类为无效地址。

它应该对应于 [^@]+@([a-zA-Z0-9-]+.)+.([a-zA-Z0-9-]+)

A regular expression is a description of a finite state machine. The classic textual representation isn't necessarily bad. It is compact, it is relatively unambigous and it is fairly well adopted.

It MAY be that a better representation would be a state transition diagram, but that would probably be hard to use in source code.

One possibility would be to build it from a variety of container and combiner objects.

Something along the line of the following (turning this from pseudo-code to language of choice is left as an exercise for the eager):

domainlabel = oneormore(characterclass("a-zA-Z0-9-"))
separator = literal(".")
domain = sequence(oneormore(sequence(domainlabel, separator)), domainlabel)
localpart = oneormore(characterclassnot("@"))
emailaddress = sequence(localpart, literal("@"), domain)

Note that the above will incorrectly classify arbritarily many email addresses as being valid that do NOT conform to the grammar they're required to follow, as that grammar requires more than a simple FSM for full parsing. I don't believe it'd misclassify a valid address as invalid, though.

It should correspond to [^@]+@([a-zA-Z0-9-]+.)+.([a-zA-Z0-9-]+)

慢慢从新开始 2024-08-15 14:26:54

4.您会在您的项目中使用如此简洁的实用程序吗?

我很可能不会。我认为这是一个使用正确的工具来完成工作的例子。这里有一些很好的答案,例如: 1579202来自杰里米·斯坦的。我最近在最近的一个项目中“获得”了正则表达式,发现它们非常有用,而且如果您知道语法,经过正确的注释,它们是可以理解的。

我认为“了解语法”部分是人们对正则表达式失去兴趣的原因,对于那些不理解正则表达式的人来说,它们看起来很神秘,但它们是应用计算机科学中的强大工具(例如以编写软件为生)我觉得聪明的专业人士应该并且应该能够学会正确地使用它们和使用它们。

正如他们所说:“能力越大,责任越大。”我见过人们在任何地方都使用正则表达式来处理所有事情,但是如果有人花时间彻底学习了语法,那么他们就会明智地使用正则表达式,它们非常有帮助;对我来说,添加另一层会在某种程度上违背他们的目的,或者至少会剥夺他们的力量。

这只是我自己的观点,我可以理解那些想要这样的框架的人来自哪里,或者那些会避免正则表达式的人,但我很难从那些没有花时间的人那里听到“正则表达式很糟糕”了解它们并做出明智的决定。

4. Would you use such a neat utility in your projects?

I would most likely not. I think this is a case of using the right tool for the job. There are some great answers here such as: 1579202 from Jeremy Stein. I have recently "gotten" regular expressions on a recent project and found them to be very useful just as they are, and when properly commented they are understandable if you know the syntax.

I think the "knowing the syntax" part is what turns people off to Regular Expressions, to those who don't understand they look arcane and cryptic, but they are a powerful tool and in applied Computer Science (e.g. writing software for a living) I feel like intelligent professionals should and should be able to learn to use them and us them appropriately.

As they say "With great power comes great responsibility." I have seen people use regular expressions everywhere for everything, but used judiciously by someone who has taken the time to learn the syntax thoroughly, they are incredibly helpful; to me, adding another layer would in a way defeat their purpose, or at a minimum take away their power.

This just my own opinion and I can understand where people are coming from who would desire a framework like this, or who would avoid regular expressions, but I have hard time hearing "Regular Expressions are bad" from those who haven't take the time to learn them and make an informed decision.

萌化 2024-08-15 14:26:54

为了让每个人都满意(正则表达式大师和流体接口支持者),请确保流体接口可以输出适当的原始正则表达式模式,并且还使用工厂方法获取常规正则表达式并为其生成流体代码。

To make everybody happy (regex masters and fluid interface proponents), make sure the fluid interface can output an appropriate raw regex pattern, and also take a regular regex using a Factory method and generate fluid code for it.

夏日落 2024-08-15 14:26:54

What you are looking for can be found here:. It is a regular expression buillder which follows the Wizard Design Pattern

留蓝 2024-08-15 14:26:54

让我们比较一下:我经常使用 (N)Hibernate ICriteria 查询,它可以被视为到 SQL 的流畅映射。我曾经(现在仍然)对它们充满热情,但它们是否使 SQL 查询更清晰?不,恰恰相反,但另一个好处出现了:以编程方式构建语句、子类化它们并创建自己的抽象等变得更加容易。

我得到的是,如果完成的话,可以为给定语言使用新接口是的,可以证明是值得的,但不要太看重它。在许多情况下,它不会变得更容易阅读(嵌套减法字符类、后向捕获、if 分支等一些难以流畅组合的高级概念)。但在很多情况下,更大灵活性的好处超过了语法复杂性所增加的开销。

要添加到可能的替代方法列表中并使其脱离仅 Java 的上下文,请考虑 LINQ 语法。它可能是这样的(有点做作)(fromwhereselect 是 LINQ 中的关键字):

// for ^str(aa|bb){3}
from part in mystring
where part startswith "str"
and part hasgroups "aa" or "bb" as first    /* "aa" or "bb" in group 'first' */
and part repeats first 3                    /* repeat group 'first' 3 times */
select part + "extra"                       /* can contain complete statement block */

只是一个粗略的想法,我知道。 LINQ的好处是它由编译器检查,是一种语言中的语言。默认情况下,LINQ 也可以表示为流畅的链式语法,这使得它(如果设计良好)可以与其他 OO 语言兼容。

Let's compare: I have worked often with (N)Hibernate ICriteria queries, which can be considered a Fluent mapping to SQL. I was (and still am) enthusiastic about them, but did they make the SQL queries more legible? No, more to the contrary, but another benefit rose: it became much easier to programmatically build statements, to subclass them and create your own abstractions etc.

What I'm getting at is that using a new interface for a given language, if done right, can prove worthwhile, but don't think too highly of it. In many cases it won't become easier to read (nested subtraction character classes, Captures in look-behind, if-branching to name a few advanced concepts that will be hard to combine fluently). But in just as many cases, the benefits of greater flexibility outweigh the added overhead of syntax complexity.

To add to your list of possible alternative approaches and to take this out of the context of only Java, consider the LINQ syntax. Here's what it could look like (a bit contrived) (from, where and select are keywords in LINQ):

// for ^str(aa|bb){3}
from part in mystring
where part startswith "str"
and part hasgroups "aa" or "bb" as first    /* "aa" or "bb" in group 'first' */
and part repeats first 3                    /* repeat group 'first' 3 times */
select part + "extra"                       /* can contain complete statement block */

just a rough idea, I know. The good thing of LINQ is that it is checked by the compiler, a kind-of language in a language. By default, LINQ can also be expressed as fluent chained syntax, which makes it, if well designed, compatible with other OO languages.

二货你真萌 2024-08-15 14:26:54

我说去做吧,我确信实施起来很有趣。

我建议使用查询模型(类似于 jQuery、django ORM),其中每个函数返回一个查询对象,这样您就可以将它们链接在一起。

any("a").some("b").one("@").some(chars).one(".").some(chars) //a*b+@\w+\.\w+

其中 chars 是预定义的以适合任何字符。

可以通过使用选择来实现

any("a").choice("x", "z") // a(x|z)

每个函数的参数可以是字符串或另一个查询。例如,上面提到的 chars 变量可以定义为查询:

//this one is ascii only
chars = raw("a-zA-Z0-9")

因此,如果使用流畅的查询感觉很麻烦,您可以使用一个“原始”函数来接受正则表达式字符串作为输入系统。

实际上,这些函数只能返回原始正则表达式,链接它们只是连接这些原始正则表达式字符串。

any("a") ---> "a*"
raw("b+") ----> "b+"
one(".") ---> "\."
choice("a", "b") ----> (a|b)

I say go for it, I'm sure it's fun to implement.

I suggest using a query model (similar to jQuery, django ORM), where each function returns a query object, so you can chain them together.

any("a").some("b").one("@").some(chars).one(".").some(chars) //a*b+@\w+\.\w+

where chars is predefined to fit any character.

or can be achieved by using choices:

any("a").choice("x", "z") // a(x|z)

The argument to each function can be a string or another query. For example, the chars variable mentioned above can be defined as a query:

//this one is ascii only
chars = raw("a-zA-Z0-9")

And so, you can have a "raw" function that accepts a regex string as input if it feels to cumbersome to use the fluent query system.

Actually, these functions can just return raw regex, and chaining them is simply concatenating these raw regex strings.

any("a") ---> "a*"
raw("b+") ----> "b+"
one(".") ---> "\."
choice("a", "b") ----> (a|b)
如日中天 2024-08-15 14:26:54

我不确定用流畅的 API 替换 regexp 会带来什么好处。

请注意,我不是正则表达式向导(几乎每次需要创建正则表达式时我都必须重新阅读文档)。

流畅的 API 会使任何中等复杂度的正则表达式(假设大约 50 个字符)比所需的更复杂,并且最终不易于阅读,尽管它可能会改善 IDE 中正则表达式的创建 ,感谢代码完成。但代码维护通常比代码开发成本更高。

事实上,我什至不确定是否有可能有一个足够智能的 API 在创建新的正则表达式时真正为开发人员提供足够的指导,而不是谈论模棱两可的情况,正如之前的答案中提到的那样。

您提到了 RFC 的正则表达式示例。我 99% 确信(仍然有 1% 的希望;-))任何 API 都不会使该示例变得更简单,但相反,这只会使它更难以阅读!这是一个典型的例子,无论如何你都不想使用正则表达式!

即使对于正则表达式的创建,由于模式不明确的问题,使用流畅的 API 很可能第一次就无法得到正确的表达式,而必须更改多次,直到获得真正想要的内容。

毫无疑问,我确实喜欢流畅的界面;我开发了一些使用它们的库,并且我使用了几个基于它们的第三方库(例如用于 Java 测试的 FEST)。但我不认为它们可以成为解决任何问题的金锤子。

如果我们只考虑 Java,我认为正则表达式的主要问题是 Java 字符串常量中反斜杠的必要转义。这使得在 Java 中创建和理解正则表达式变得异常困难。因此,对我来说,增强 Java 正则表达式的第一步是语言更改,a la Groovy,其中字符串常量不需要转义反斜杠。

到目前为止,这是我改进 Java 中正则表达式的唯一建议。

I am not sure that replacing regexp with a fluent API would bring much.

Note that I am not a regexp wizard (I have to re-read the doc nearly every time I need to create a regexp).

A fluent API would make any medium-complexity regexp (let's say ~50 characters) even more complex than required and not easier to read in the end, although it may improve the creation of a regexp in an IDE, thanks to code completion. But code maintenance generally represents a higher cost than code development.

In fact, I am not even sure it would be possible to have an API smart enough to really provide enough guidance to the developer when creating a new regexp, not talking about ambiguous cases, as mentioned in a previous answer.

You mentioned a regexp example for an RFC. I am 99% sure (there is still 1% hope;-)) that any API would not make that example any simpler, but conversely that would only make it more complex to read! That's a typical example where you don't want to use regexp anyway!

Even regarding regexp creation, due to the problem of ambiguous patterns, it is probable that with a fluent API, you would never come with the right expression the first time, but would have to change several times until you get what you really want.

Make no mistake, I do love fluent interfaces; I have developed some libraries that use them, and I use several 3rd-party libraries based on them (e.g. FEST for Java testing). But I don't think they can be the golden hammer for any problem.

If we consider Java exclusively, I think the main problem with regexps is the necessary escaping of backslashes in Java string constants. That's one point that makes it incredibly difficult to create and understand regexp in Java. Hence, the first step to enhance Java regexp would, for me, be a language change, a la Groovy, where string constants don't need to escape backslashes.

So far that would be my only proposal to improve regexp in Java.

鲜肉鲜肉永远不皱 2024-08-15 14:26:54

尝试一下 RegexBee
这是一个流畅的正则表达式构建器,
这可能是一个很好的选择,因为它...

  • 流畅、
  • 不可变
  • 、分层、可组合、可重用,
  • 基于功能接口,
  • 高效
  • 缓存,
  • 不需要静态导入,只是为了紧凑
  • ,支持最常用的正则表达式功能,开箱即用,
  • 免费和开源

自述文件中的示例:

BeeFragment myFragment = Bee
        .then(Bee.BEGIN)
        .then(Bee.ASCII_WORD)
        .then(Bee.WHITESPACE.any())
        .then(Bee.ASCII_WORD.as("nameX"))
        .then(Bee.intBetween(-4, 1359)
                .or(Bee.fixed("-").more(Greediness.POSSESSIVE)))
        .then(Bee.ref("nameX"))
        .then(Bee.fixed("??)fixed?text. ").optional())
        .then(Bee.ASCII_WORD.optional(Greediness.LAZY))
        .then(Bee.END);

Pattern myPattern = myFragment.toPattern();
Matcher myMatcher = myPattern.matcher(someString);

if (myMatcher.matches()) {
    String nameX = myMatcher.group("nameX");
    System.out.println(String.format("We have a nice day, %s!", nameX));
}

Make a try with RegexBee.
It's a fluent regex builder,
which can be a good alternative, because it's...

  • fluent
  • immutable
  • hierarchical, composable, reusable
  • based on a functional interface
  • efficient
  • cached
  • doesn't need static imports just to be compact
  • supports most-used regex features out-of-the-box
  • free and open source

An example from its readme:

BeeFragment myFragment = Bee
        .then(Bee.BEGIN)
        .then(Bee.ASCII_WORD)
        .then(Bee.WHITESPACE.any())
        .then(Bee.ASCII_WORD.as("nameX"))
        .then(Bee.intBetween(-4, 1359)
                .or(Bee.fixed("-").more(Greediness.POSSESSIVE)))
        .then(Bee.ref("nameX"))
        .then(Bee.fixed("??)fixed?text. ").optional())
        .then(Bee.ASCII_WORD.optional(Greediness.LAZY))
        .then(Bee.END);

Pattern myPattern = myFragment.toPattern();
Matcher myMatcher = myPattern.matcher(someString);

if (myMatcher.matches()) {
    String nameX = myMatcher.group("nameX");
    System.out.println(String.format("We have a nice day, %s!", nameX));
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文