引号之间匹配的正则表达式,包含转义引号

发布于 2024-07-16 10:00:44 字数 2671 浏览 4 评论 0原文

这本来是我想问的问题,但在研究该问题的详细信息时,我找到了解决方案,并认为其他人可能会感兴趣。

在 Apache 中,完整的请求用双引号引起来,里面的任何引号总是用反斜杠转义:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"

我正在尝试构建一个匹配所有不同字段的正则表达式。 我当前的解决方案总是在 GET/POST 之后的第一个引号处停止(实际上我只需要所有值,包括传输的大小):

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)

我想我还会提供我的来自我的 PHP 源代码的解决方案,带有注释和更好的格式:

$sPattern = ';^' .
    # ip address: 1
    '(\d+\.\d+\.\d+\.\d+)' .
    # ident and user id
    '\s+[^\s]+\s+[^\s]+\s+' .
    # 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
    '\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
    # whitespace
    '\s+' .
    # request uri
    '"[^"]+"' .
    # whitespace
    '\s+' .
    # 8 status code
    '(\d+)' .
    # whitespace
    '\s+' .
    # 9 bytes sent
    '(\d+|-)' .
    # end of regex
    ';';

在 URL 不包含其他引号的简单情况下使用此解决方案效果很好:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"

现在我试图获得对 \" 的无、一次或多次出现的支持 进入它,但找不到解决方案,到目前为止我已经想出了这个:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"

这只是更改的部分:

    # request uri
    '"(.|\\(?="))*"' .

但是,它太贪婪了,直到最后一个。 “,它应该只吃到第一个 ,前面没有 \ 我还尝试引入没有 \ 在我想要的 " 之前,但它仍然吃到字符串的末尾(注意:我必须添加无关的 \ 字符才能在 PHP 中实现此功能):

    # request uri
    '"(.|\\(?="))*[^\\\\]"' .

但后来我突然想到:*?:如果在任何量词 、+、? 或 {} 之后立即使用,则使量词非贪婪 (匹配最小次数)

    # request uri
    '"(.|\\(?="))*?[^\\\\]"' .

完整的正则表达式:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)

2009 年 5 月 5 日更新:

由于解析数百万行,我在正则表达式中发现了一个小缺陷:它在包含反斜杠字符的行上中断就在双引号之前。 换句话说:

...\\"

会破坏正则表达式。 Apache 不会记录 ...\" 但始终会将反斜杠转义为 \\,因此可以安全地假设当双引号前有两个反斜杠字符时任何

人都知道如何使用正则表达式解决此问题?

有用的资源: JavaScript Regexp 文档位于developer.mozilla.orgregexpal.com

This was originally a question I wanted to ask, but while researching the details for the question I found the solution and thought it may be of interest to others.

In Apache, the full request is in double quotes and any quotes inside are always escaped with a backslash:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"

I'm trying to construct a regex which matches all distinct fields. My current solution always stops on the first quote after the GET/POST (actually I only need all the values including the size transferred):

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)

I guess I'll also provide my solution from my PHP source with comments and better formatting:

$sPattern = ';^' .
    # ip address: 1
    '(\d+\.\d+\.\d+\.\d+)' .
    # ident and user id
    '\s+[^\s]+\s+[^\s]+\s+' .
    # 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
    '\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
    # whitespace
    '\s+' .
    # request uri
    '"[^"]+"' .
    # whitespace
    '\s+' .
    # 8 status code
    '(\d+)' .
    # whitespace
    '\s+' .
    # 9 bytes sent
    '(\d+|-)' .
    # end of regex
    ';';

Using this with a simple case where the URL doesn't contain other quotes works fine:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"

Now I'm trying to get support for none, one or more occurrences of \" into it, but can't find a solution. Using regexpal.com I've came up with this so far:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"

Here's only the changed part:

    # request uri
    '"(.|\\(?="))*"' .

However, it's too greedy. It eats everything until the last ", when it should only eat until the first " not preceded by a \. I also tried introducing the requirement that there's no \ before the " I want, but it still eats to the end of the string (Note: I had to add extraneous \ characters to make this work in PHP):

    # request uri
    '"(.|\\(?="))*[^\\\\]"' .

But then it hit me: *?: If used immediately after any of the quantifiers , +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times)

    # request uri
    '"(.|\\(?="))*?[^\\\\]"' .

The full regex:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)

Update 5th May 2009:

I discovered a small flaw in the regexp due parsing millions of lines: it breaks on lines which contain the backslash character right before the double quote. In other words:

...\\"

will break the regex. Apache will not log ...\" but will always escape the backslash to \\, so it's safe to assume that when there're two backslash characters before the double quote.

Anyone has an idea how to fix this with the the regex?

Helpful resources: the JavaScript Regexp documentation at developer.mozilla.org and regexpal.com

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浮世清欢 2024-07-23 10:00:44

试试这个:

"(?:[^\\"]+|\\.)*"

此正则表达式匹配双引号字符,后跟除 \" 以外的任何字符的序列或转义序列 \ α(其中 α 可以是任何字符)后跟最后的双引号字符 (。 ?:expr) 语法只是一个非捕获组。

Try this:

"(?:[^\\"]+|\\.)*"

This regular expression matches a double quote character followed by a sequence of either any character other than \ and " or an escaped sequence \α (where α can be any character) followed by the final double quote character. The (?:expr) syntax is just a non-capturing group.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文