引号之间匹配的正则表达式,包含转义引号
这本来是我想问的问题,但在研究该问题的详细信息时,我找到了解决方案,并认为其他人可能会感兴趣。
在 Apache 中,完整的请求用双引号引起来,里面的任何引号总是用反斜杠转义:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"
我正在尝试构建一个匹配所有不同字段的正则表达式。 我当前的解决方案总是在 GET
/POST
之后的第一个引号处停止(实际上我只需要所有值,包括传输的大小):
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)
我想我还会提供我的来自我的 PHP 源代码的解决方案,带有注释和更好的格式:
$sPattern = ';^' .
# ip address: 1
'(\d+\.\d+\.\d+\.\d+)' .
# ident and user id
'\s+[^\s]+\s+[^\s]+\s+' .
# 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
'\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
# whitespace
'\s+' .
# request uri
'"[^"]+"' .
# whitespace
'\s+' .
# 8 status code
'(\d+)' .
# whitespace
'\s+' .
# 9 bytes sent
'(\d+|-)' .
# end of regex
';';
在 URL 不包含其他引号的简单情况下使用此解决方案效果很好:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"
现在我试图获得对 \" 的无、一次或多次出现的支持
进入它,但找不到解决方案,到目前为止我已经想出了这个:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"
这只是更改的部分:
# request uri
'"(.|\\(?="))*"' .
但是,它太贪婪了,直到最后一个。 “
,它应该只吃到第一个 ”
,前面没有 \
我还尝试引入没有 \ 在我想要的
"
之前,但它仍然吃到字符串的末尾(注意:我必须添加无关的 \
字符才能在 PHP 中实现此功能):
# request uri
'"(.|\\(?="))*[^\\\\]"' .
但后来我突然想到:*?
:如果在任何量词 、+、? 或 {} 之后立即使用,则使量词非贪婪 (匹配最小次数)
# request uri
'"(.|\\(?="))*?[^\\\\]"' .
完整的正则表达式:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)
2009 年 5 月 5 日更新:
由于解析数百万行,我在正则表达式中发现了一个小缺陷:它在包含反斜杠字符的行上中断就在双引号之前。 换句话说:
...\\"
会破坏正则表达式。 Apache 不会记录 ...\"
但始终会将反斜杠转义为 \\
,因此可以安全地假设当双引号前有两个反斜杠字符时任何
人都知道如何使用正则表达式解决此问题?
有用的资源: JavaScript Regexp 文档位于developer.mozilla.org 和regexpal.com
This was originally a question I wanted to ask, but while researching the details for the question I found the solution and thought it may be of interest to others.
In Apache, the full request is in double quotes and any quotes inside are always escaped with a backslash:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"
I'm trying to construct a regex which matches all distinct fields. My current solution always stops on the first quote after the GET
/POST
(actually I only need all the values including the size transferred):
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)
I guess I'll also provide my solution from my PHP source with comments and better formatting:
$sPattern = ';^' .
# ip address: 1
'(\d+\.\d+\.\d+\.\d+)' .
# ident and user id
'\s+[^\s]+\s+[^\s]+\s+' .
# 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
'\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
# whitespace
'\s+' .
# request uri
'"[^"]+"' .
# whitespace
'\s+' .
# 8 status code
'(\d+)' .
# whitespace
'\s+' .
# 9 bytes sent
'(\d+|-)' .
# end of regex
';';
Using this with a simple case where the URL doesn't contain other quotes works fine:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"
Now I'm trying to get support for none, one or more occurrences of \"
into it, but can't find a solution. Using regexpal.com I've came up with this so far:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"
Here's only the changed part:
# request uri
'"(.|\\(?="))*"' .
However, it's too greedy. It eats everything until the last "
, when it should only eat until the first "
not preceded by a \
. I also tried introducing the requirement that there's no \
before the "
I want, but it still eats to the end of the string (Note: I had to add extraneous \
characters to make this work in PHP):
# request uri
'"(.|\\(?="))*[^\\\\]"' .
But then it hit me: *?
: If used immediately after any of the quantifiers , +, ?, or {}, makes the quantifier non-greedy (matching the minimum number of times)
# request uri
'"(.|\\(?="))*?[^\\\\]"' .
The full regex:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)
Update 5th May 2009:
I discovered a small flaw in the regexp due parsing millions of lines: it breaks on lines which contain the backslash character right before the double quote. In other words:
...\\"
will break the regex. Apache will not log ...\"
but will always escape the backslash to \\
, so it's safe to assume that when there're two backslash characters before the double quote.
Anyone has an idea how to fix this with the the regex?
Helpful resources: the JavaScript Regexp documentation at developer.mozilla.org and regexpal.com
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
试试这个:
此正则表达式匹配双引号字符,后跟除
\
和"
以外的任何字符的序列或转义序列\
α
(其中α
可以是任何字符)后跟最后的双引号字符(。 ?:
expr
)
语法只是一个非捕获组。Try this:
This regular expression matches a double quote character followed by a sequence of either any character other than
\
and"
or an escaped sequence\
α
(whereα
can be any character) followed by the final double quote character. The(?:
expr
)
syntax is just a non-capturing group.