当前位置：文江博客话题详情

regex language-design

为什么正则表达式不能使用关键字而不是字符？

发布于 2024-07-14 17:40:32 字数 181 浏览 6 评论 0 原文

好吧，我几乎不了解 RegEx 基础知识，但为什么他们不能将其设计为使用关键字（如 SQL）而不是一些神秘的通配符和符号呢？

由于正则表达式是在运行时解释/解析的，所以是为了性能吗？（未编译）

或者也许是为了写作速度？考虑到当您学习一些“简单”字符组合时，输入 1 个字符而不是关键字会变得更容易吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我也只是我 2024-07-21 17:40:32

你真的想要这个吗？

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

好吧，但是这是你的葬礼，伙计。

在此处下载执行此操作的库：
http://flimflan.com/blog/ReadableRegularExpressions.aspx

You really want this?

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

Ok, but it's your funeral, man.

Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

回复收藏 0 原文

菩提树下叶撕阳。 2024-07-21 17:40:32

正则表达式具有数学（实际上是语言理论）背景，其编码有点像数学公式。您可以通过一组规则来定义它们，例如

每个字符都是正则表达式，
如果 a 和 b 是正则表达式，则表示其自身，则 a?< /code>、a|b 和 ab 也是正则表达式
...

使用基于关键字的语言对于简单的正则表达式来说将是一个很大的负担。大多数时候，您只会使用简单的文本字符串作为搜索模式：

grep -R 'main' *.c

或者可能非常简单的模式：

grep -c ':-[)(]' seidl.txt

一旦您习惯了正则表达式，这种语法就非常清晰和精确。在更复杂的情况下，您可能会使用其他东西，因为大的正则表达式显然难以阅读。

Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example

every character is a regular expression, representing itself
if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
...

Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:

grep -R 'main' *.c

Or maybe very simple patterns:

grep -c ':-[)(]' seidl.txt

Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.

回复收藏 0 原文

扬花落满肩 2024-07-21 17:40:32

Perl 6 在正则表达式可读性方面向前迈出了相当革命性的一步。考虑以下形式的地址：
100 E Main St Springfield MA 01234

这是一个可读性中等的 Perl 5 兼容正则表达式来解析（许多极端情况未处理）：

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

此 Perl 6 正则表达式具有相同的行为：

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

Perl 6 语法是一个类，并且标记都是可调用的方法。像这样使用它：

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "
lt;city>, lt;state>";
}

这个例子来自我在演讲 ="http://www.frozen-perl.org/mpw2009/" rel="nofollow noreferrer">Frozen Perl 2009 研讨会。 Perl 6 的 Rakudo 实现足够完整，这个示例现在仍然可以运行。

Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form:
100 E Main St Springfield MA 01234

Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

This Perl 6 regex has the same behavior:

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "
lt;city>, lt;state>";
}

This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.

回复收藏 0 原文

萤火眠眠 2024-07-21 17:40:32

那么，如果您有关键字，您如何轻松地将它们与实际匹配的文本区分开来？你会如何处理空白？

源文本
公司：A 部门：B

标准正则表达式：

Company:\s+(.+)\s+Dept.:\s+(.+)

或者甚至：

Company: (.+) Dept. (.+)

关键字正则表达式（非常努力地没有找到稻草人......）

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

或者简化：

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

不，这可能不会更好。

Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?

Source text
Company: A Dept.: B

Standard regex:

Company:\s+(.+)\s+Dept.:\s+(.+)

Or even:

Company: (.+) Dept. (.+)

Keyword regex (trying really hard not get a strawman...)

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

Or simplified:

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

No, it's probably not better.

回复收藏 0 原文

池木 2024-07-21 17:40:32

因为它对应于形式语言理论和数学符号。

回复收藏 0 原文

我不咬妳我踢妳 2024-07-21 17:40:32

这是珀尔的错......！

实际上，更具体地说，正则表达式来自早期的 Unix 开发，当时简洁的语法更加受重视。存储、处理时间、物理终端等都非常有限，与今天不同。

维基百科上正则表达式的历史解释了更多信息。

正则表达式还有其他替代方案，但我不确定是否有任何替代方案真正流行起来。

编辑：John Saunders 更正：正则表达式由 Unix 流行，但首先由 QED 编辑器。同样的设计限制也适用于早期的系统，甚至更是如此。

回复收藏 0 原文

迷迭香的记忆 2024-07-21 17:40:32

事实上，不，世界并不是从 Unix 开始的。如果你阅读维基百科文章，你会发现

在 20 世纪 50 年代，数学家 Stephen Cole Kleene 使用他的称为正则集的数学符号描述了这些模型。 SNOBOL 语言是模式匹配的早期实现，但与正则表达式不同。 Ken Thompson 将 Kleene 的表示法内置到编辑器 QED 中，作为匹配文本文件中模式的一种方法。后来他将这一功能添加到了 Unix 编辑器 ed 中，最终导致了流行的搜索工具 grep 对正则表达式的使用

回复收藏 0 原文

蹲在坟头点根烟 2024-07-21 17:40:32

这比 PERL 早得多。关于正则表达式的维基百科条目将正则表达式的第一个实现归功于 UNIX 的 Ken Thompson名声大噪，他在 QED 中实现了它们，然后在 ed 编辑器中实现了它们。我猜想这些命令出于性能原因而具有简短的名称，但早在客户端之前。掌握正则表达式是一本关于正则表达式的好书，它提供了注释正则表达式的选项（使用 /x标志）以使其更易于阅读和理解。

回复收藏 0 原文

你在看孤独的风景 2024-07-21 17:40:32

因为正则表达式的理念（就像许多源自 UNIX 的东西一样）是简洁的，注重简洁性而不是可读性。这其实是一件好事。我最终编写了 15 行长的正则表达式（与我更好的判断相反）。如果它有详细的语法，那么它就不是正则表达式，而是一个程序。

回复收藏 0 原文

〆一缕阳光ご 2024-07-21 17:40:32

实际上，实现“更冗长”形式的正则表达式非常容易 - 请在此处查看我的答案。简而言之：编写一些返回正则表达式字符串的函数（并在必要时接受参数）。

回复收藏 0 原文

萌化 2024-07-21 17:40:32

我认为关键字不会带来任何好处。正则表达式本身很复杂，但也非常强大。

我认为更令人困惑的是，每个支持库都发明了自己的语法，而不是使用（或扩展）经典的 Perl 正则表达式（例如 \1、$1、{1}、...用于替换和更多示例）。

回复收藏 0 原文

他是夢罘是命 2024-07-21 17:40:32

我知道它以错误的方式回答你的问题，但是 RegExBuddy 有一个功能可以用简单的英语解释你的正则表达式。这可能会让学习变得更容易一些。

回复收藏 0 原文

凉城凉梦凉人心 2024-07-21 17:40:32

如果您使用的语言支持 Posix 正则表达式，您就可以使用它们。

一个例子：

\d

与相同

[:digit:]

括号符号对于它匹配的内容更加清晰。我仍然会学习“神秘的通配符和符号，因为您仍然会在其他人的代码中看到它们并且需要理解它们。

正则表达式.info 页面上的表格。

If the language you are using supports Posix regexes, you can use them.

An example:

\d

would be the same as

[:digit:]

The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.

There are more examples in the table on regular-expressions.info's page.

回复收藏 0 原文