为什么正则表达式不能使用关键字而不是字符?

发布于 2024-07-14 17:40:32 字数 181 浏览 6 评论 0 原文

好吧,我几乎不了解 RegEx 基础知识,但为什么他们不能将其设计为使用关键字(如 SQL)而不是一些神秘的通配符和符号呢?

由于正则表达式是在运行时解释/解析的,所以是为了性能吗? (未编译)

或者也许是为了写作速度? 考虑到当您学习一些“简单”字符组合时,输入 1 个字符而不是关键字会变得更容易吗?

Okay, I barely understand RegEx basics, but why couldn't they design it to use keywords (like SQL) instead of some cryptic wildcard characters and symbols?

Is it for performance since the RegEx is interpreted/parsed at runtime? (not compiled)

Or maybe for speed of writing? Considering that when you learn some "simple" character combinations it becomes easier to type 1 character instead of a keyword?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

我也只是我 2024-07-21 17:40:32

你真的想要这个吗?

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

好吧,但是这是你的葬礼,伙计。

在此处下载执行此操作的库:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

You really want this?

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

Ok, but it's your funeral, man.

Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

菩提树下叶撕阳。 2024-07-21 17:40:32

正则表达式具有数学(实际上是语言理论)背景,其编码有点像数学公式。 您可以通过一组规则来定义它们,例如

  • 每个字符都是正则表达式,
  • 如果 ab 是正则表达式,则表示其自身,则 a?< /code>、a|bab 也是正则表达式
  • ...

使用基于关键字的语言对于简单的正则表达式来说将是一个很大的负担。 大多数时候,您只会使用简单的文本字符串作为搜索模式:

grep -R 'main' *.c

或者可能非常简单的模式:

grep -c ':-[)(]' seidl.txt

一旦您习惯了正则表达式,这种语法就非常清晰和精确。 在更复杂的情况下,您可能会使用其他东西,因为大的正则表达式显然难以阅读。

Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example

  • every character is a regular expression, representing itself
  • if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
  • ...

Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:

grep -R 'main' *.c

Or maybe very simple patterns:

grep -c ':-[)(]' seidl.txt

Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.

扬花落满肩 2024-07-21 17:40:32

Perl 6 在正则表达式可读性方面向前迈出了相当革命性的一步。 考虑以下形式的地址:
100 E Main St Springfield MA 01234

这是一个可读性中等的 Perl 5 兼容正则表达式来解析(许多极端情况未处理):

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

此 Perl 6 正则表达式具有相同的行为:

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

Perl 6 语法是一个类,并且标记都是可调用的方法。 像这样使用它:

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "
lt;city>, 
lt;state>";
}

这个例子来自我在演讲 ="http://www.frozen-perl.org/mpw2009/" rel="nofollow noreferrer">Frozen Perl 2009 研讨会。 Perl 6 的 Rakudo 实现足够完整,这个示例现在仍然可以运行。

Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form:
100 E Main St Springfield MA 01234

Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

This Perl 6 regex has the same behavior:

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "
lt;city>, 
lt;state>";
}

This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.

萤火眠眠 2024-07-21 17:40:32

那么,如果您有关键字,您如何轻松地将它们与实际匹配的文本区分开来? 你会如何处理空白?

源文本
公司:A 部门:B

标准正则表达式:

Company:\s+(.+)\s+Dept.:\s+(.+)

或者甚至:

Company: (.+) Dept. (.+)

关键字正则表达式(非常努力地没有找到稻草人......)

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

或者简化:

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

不,这可能不会更好。

Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?

Source text
Company: A Dept.: B

Standard regex:

Company:\s+(.+)\s+Dept.:\s+(.+)

Or even:

Company: (.+) Dept. (.+)

Keyword regex (trying really hard not get a strawman...)

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

Or simplified:

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

No, it's probably not better.

池木 2024-07-21 17:40:32

因为它对应于形式语言理论和数学符号。

Because it corresponds to formal language theory and it's mathematic notation.

我不咬妳我踢妳 2024-07-21 17:40:32

这是珀尔的错......!

实际上,更具体地说,正则表达式来自早期的 Unix 开发,当时简洁的语法更加受重视。 存储、处理时间、物理终端等都非常有限,与今天不同。

维基百科上正则表达式的历史解释了更多信息。

正则表达式还有其他替代方案,但我不确定是否有任何替代方案真正流行起来。

编辑:John Saunders 更正:正则表达式由 Unix 流行,但首先由 QED 编辑器。 同样的设计限制也适用于早期的系统,甚至更是如此。

It's Perl's fault...!

Actually, more specifically, Regular Expressions come from early Unix development, and concise syntax was a lot more highly valued then. Storage, processing time, physical terminals, etc were all very limited, rather unlike today.

The history of Regular Expressions on Wikipedia explains more.

There are alternatives to Regex, but I'm not sure any have really caught on.

EDIT: Corrected by John Saunders: Regular Expressions were popularised by Unix, but first implemented by the QED editor. The same design constraints applied, even more so, to earlier systems.

迷迭香的记忆 2024-07-21 17:40:32

事实上,不,世界并不是从 Unix 开始的。 如果你阅读维基百科文章,你会发现

在 20 世纪 50 年代,数学家 Stephen Cole Kleene 使用他的称为正则集的数学符号描述了这些模型。 SNOBOL 语言是模式匹配的早期实现,但与正则表达式不同。 Ken Thompson 将 Kleene 的表示法内置到编辑器 QED 中,作为匹配文本文件中模式的一种方法。 后来他将这一功能添加到了 Unix 编辑器 ed 中,最终导致了流行的搜索工具 grep 对正则表达式的使用

Actually, no, the world did not begin with Unix. If you read the Wikipedia article, you'll see that

In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions

蹲在坟头点根烟 2024-07-21 17:40:32

这比 PERL 早得多。 关于正则表达式的维基百科条目将正则表达式的第一个实现归功于 UNIX 的 Ken Thompson名声大噪,他在 QED 中实现了它们,然后在 ed 编辑器中实现了它们。 我猜想这些命令出于性能原因而具有简短的名称,但早在客户端之前。 掌握正则表达式是一本关于正则表达式的好书,它提供了注释正则表达式的选项(使用 /x标志)以使其更易于阅读和理解。

This is much earlier than PERL. The Wikipedia entry on Regular Expressions attributes the first implementations of regular expressions to Ken Thompson of UNIX fame, who implemented them in the QED and then the ed editor. I guess that the commands had short names for performance reasons, but much before being client-side. Mastering Regular Expressions is a great book about regular expressions, which offers the option to annotate a regular expression (with the /x flag) to make it easier to read and understand.

你在看孤独的风景 2024-07-21 17:40:32

因为正则表达式的理念(就像许多源自 UNIX 的东西一样)是简洁的,注重简洁性而不是可读性。 这其实是一件好事。 我最终编写了 15 行长的正则表达式(与我更好的判断相反)。 如果它有详细的语法,那么它就不是正则表达式,而是一个程序。

Because the idea of regular expressions--like many things that originate from UNIX--is that they are terse, favouring brevity over readability. This is actually a good thing. I've ended up writing regular expressions (against my better judgement) that are 15 lines long. If that had a verbose syntax it wouldn't be a regex, it'd be a program.

〆一缕阳光ご 2024-07-21 17:40:32

It's actually pretty easy to implement a "wordier" form of regex -- please see my answer here. In a nutshell: write a handful of functions that return regex strings (and take parameters if necessary).

萌化 2024-07-21 17:40:32

我认为关键字不会带来任何好处。 正则表达式本身很复杂,但也非常强大。

我认为更令人困惑的是,每个支持库都发明了自己的语法,而不是使用(或扩展)经典的 Perl 正则表达式(例如 \1、$1、{1}、...用于替换和更多示例)。

I don't think keywords would give any benefit. Regular expressions as such are complex but also very powerful.

What I think is more confusing is that every supporting library invents its own syntax instead of using (or extending) the classic Perl regex (e.g. \1, $1, {1}, ... for replacements and many more examples).

他是夢罘是命 2024-07-21 17:40:32

我知道它以错误的方式回答你的问题,但是 RegExBuddy 有一个功能可以用简单的英语解释你的正则表达式。 这可能会让学习变得更容易一些。

I know its answering your question the wrong way around, but RegExBuddy has a feature that explains your regexpression in plain english. This might make it a bit easier to learn.

凉城凉梦凉人心 2024-07-21 17:40:32

如果您使用的语言支持 Posix 正则表达式,您就可以使用它们。

一个例子:

\d

与 相同

[:digit:]

括号符号对于它匹配的内容更加清晰。 我仍然会学习“神秘的通配符和符号,因为您仍然会在其他人的代码中看到它们并且需要理解它们。

正则表达式.info 页面上的表格

If the language you are using supports Posix regexes, you can use them.

An example:

\d

would be the same as

[:digit:]

The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.

There are more examples in the table on regular-expressions.info's page.

陌上芳菲 2024-07-21 17:40:32

由于某种原因,我之前的回答被删除了。 无论如何,我认为 ruby​​ regexp 机器符合要求,位于 http://www.rubyregexp.sf.net。 这是我自己的项目,但我认为它应该可行。

For some reason, my previous answer got deleted. Anyway, i thing ruby regexp machine would fit the bill, at http://www.rubyregexp.sf.net. It is my own project, but i think it should work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文