使用正则表达式 (.NET) 对复杂字符串(Snort 规则)进行标记
我需要正则表达式向导的帮助。我正在尝试编写一个简单的解析器,它可以标记 Snort 规则(Snort,IDS/IPS 软件)的选项列表。问题是,我似乎找不到一个可行的公式来根据终止分号来分解各个规则选项。我编写的公式将括号之间的所有选项捕获到单个捕获组中。
我正在 GSkinner 站点上使用出色的 RegExr 工具,其中包含以下一些来自新兴威胁的示例规则选项(我解析出规则头——这很容易标记):
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot|0d 0a|"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2003409; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2003409; rev:4;)
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent Inbound (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2007762; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2007762; rev:4;)
(msg:"ET POLICY McAfee Update User Agent (McAfee AutoUpdate)"; flow:to_server,established; content:"User-Agent|3a| "; http_header; nocase; content:"McAfee AutoUpdate"; http_header; pcre:"/User-Agent\x3a[^\n]+McAfee AutoUpdate/i"; classtype:not-suspicious; reference:url,doc.emergingthreats.net/2003381; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_McAffee; sid:2003381; rev:6;)
(msg:"ET DELETED Metacafe.com family filter off"; flow:established,to_server; content:"POST"; http_method; content:"Host|3a| www.metacafe.com"; http_header; fast_pattern:6,16; content:"submit=Continue+-+I%27m+over+18"; classtype:policy-violation; reference:url,doc.emergingthreats.net/2006367; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Metacafe; sid:2006367; rev:7;)
这就是公式:
([a-zA-Z0-9_:]+(?:[\w\s.,\-/=<>+!\[\]\(\)\{\}\"|\\;'?`~@#$%^&*])+;)
问题是,它不处理冒号。因此,上面的两条规则将无法正确解析其“内容”选项。但在 RegExr 上,每个选项都会以蓝色突出显示,包括终止分号,但不包括分号后面的空格。如果我将其输入 .NET,我应该能够执行 Regex.Split 并正确分解所有标记。
如果我将冒号添加到字符列表中,那么在 RegExr 上,整套规则将被标记为单个文本块,这不是我想要的。进一步尝试调整公式会导致 Adobe Flash 崩溃,这表明我遇到了 Flash 或 RegExr 中的错误。
我没有排除编写自己的字符串标记生成器,但我希望正则表达式可以使我免于处理诸如计算开放引号、转义字符、空格等之类的事情。Snort
通常采用以下格式:
option:value;
option:"string value";
option:!"negated string value";
option:>num;
option:param1,param2,param3;
规则选项 其值往往具有更多“异国情调”格式,例如 byte_test。每个人都最喜欢的“pcre”,它基本上是执行与 perl 兼容的正则表达式的选项。因此,任何此类标记生成器都必须避免在遇到包含正则表达式的“pcre”关键字时感到困惑。
想法?
编辑: 下面的内容非常接近:
([\w]+:?(?:[\x20]|)?(?:[\x00-\xff])*?;)
但是,根据 RegExr,它被 pcre 语法搞乱了:
(msg:"ET WEB_SPECIFIC_APPS Horde 3.0.9-3.1.0 Help Viewer Remote PHP Exploit"; flow:established,to_server; content:"/services/help/"; nocase; http_uri; pcre:"/module=[^\;]*\;.*\"/UGi"; classtype:web-application-attack; reference:url,www.milw0rm.com/exploits/1660; reference:cve,2006-1491; reference:bugtraq,17292; reference:url,doc.emergingthreats.net/2002867; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/WEB_SPECIFIC_APPS/WEB_Horde; sid:2002867; rev:9; http_method;)
在上面,每个选项都突出显示为不同的分组,除了 ]*\;.*\"/< /code>
。我认为
\x00-\xff
会得到所有内容,但看来我正在使用惰性匹配,贪婪匹配会得到所有内容,包括所有空格。在选项之间,我不这样做所以我需要以某种方式修改正则表达式来处理标记化 PCRE 文本。
Edit2:这可以解决问题:
([\w]+:?(?:[\x20]|)?(?<!\\)\"?.*?(?<!\\)\"?;)
我必须使用一些有效的示例正则表达式最后意识到我正在盯着避免转义引号的负向后查找,这似乎也解决了任何其他转义字符,因为转义字符仅出现在未转义引号内。
I need help from the Regex wizards out there. I am trying to write a simple parser that can tokenize the options list of a Snort rule (Snort, the IDS/IPS software). Problem is, I can't seem to find a workable formula that breaks apart the individual rule options based on their terminating semi-colon. The formulas that I have cooked up grab all options between parenthesis into a single capture group.
I am using the excellent RegExr tool at the GSkinner site with some of the below sample rule options from Emerging Threats (I parsed off the rule header -- that's easy to tokenize):
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot|0d 0a|"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2003409; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2003409; rev:4;)
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent Inbound (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2007762; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2007762; rev:4;)
(msg:"ET POLICY McAfee Update User Agent (McAfee AutoUpdate)"; flow:to_server,established; content:"User-Agent|3a| "; http_header; nocase; content:"McAfee AutoUpdate"; http_header; pcre:"/User-Agent\x3a[^\n]+McAfee AutoUpdate/i"; classtype:not-suspicious; reference:url,doc.emergingthreats.net/2003381; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_McAffee; sid:2003381; rev:6;)
(msg:"ET DELETED Metacafe.com family filter off"; flow:established,to_server; content:"POST"; http_method; content:"Host|3a| www.metacafe.com"; http_header; fast_pattern:6,16; content:"submit=Continue+-+I%27m+over+18"; classtype:policy-violation; reference:url,doc.emergingthreats.net/2006367; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Metacafe; sid:2006367; rev:7;)
And this is the formula:
([a-zA-Z0-9_:]+(?:[\w\s.,\-/=<>+!\[\]\(\)\{\}\"|\\;'?`~@#$%^&*])+;)
The problem is, it doesn't handle colons. So two of the rules above will not have their 'content' options properly parsed. But on RegExr, each option will be highlighted in blue, including the terminating semi-colon, but NOT the space after the semi-colon. If I fed this into .NET, I should be able to do a Regex.Split and break apart all the tokens correctly.
If I add the colon to the character list, then on RegExr, the entire set of rules will get tokenized as a single blob of text, which is not what I want. Further attempts to tweak the formula result in Adobe Flash crashing, indicating I'm hitting a bug in either Flash or RegExr.
I've not ruled out writing my own string tokenizer, but I was hoping regex could save me from dealing with things like counting my open quotations, escaped characters, whitespace, etc.
Snort rule options typically come in the following format:
option:value;
option:"string value";
option:!"negated string value";
option:>num;
option:param1,param2,param3;
But several options tend to have more 'exotic' formats for their value, like byte_test. And everyone's favourite, 'pcre', which is basically an option for performing perl-compatible regex's. So any such tokenizer has to avoid getting confused if it runs into the 'pcre' keyword with regex in it.
Thoughts?
Edit:
This below is REALLY close:
([\w]+:?(?:[\x20]|)?(?:[\x00-\xff])*?;)
But, according to RegExr, it gets messed by pcre syntax:
(msg:"ET WEB_SPECIFIC_APPS Horde 3.0.9-3.1.0 Help Viewer Remote PHP Exploit"; flow:established,to_server; content:"/services/help/"; nocase; http_uri; pcre:"/module=[^\;]*\;.*\"/UGi"; classtype:web-application-attack; reference:url,www.milw0rm.com/exploits/1660; reference:cve,2006-1491; reference:bugtraq,17292; reference:url,doc.emergingthreats.net/2002867; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/WEB_SPECIFIC_APPS/WEB_Horde; sid:2002867; rev:9; http_method;)
In the above, every single option is highlighted as a distinct grouping, except ]*\;.*\"/
. I would think that \x00-\xff
would get it all, but it appears that I am using a lazy match. A greedy match gets everything, including all the spaces between options, which I do not want. So I need to somehow modify the regex to handle tokenizing pcre text.
Edit2:This does the trick:
([\w]+:?(?:[\x20]|)?(?<!\\)\"?.*?(?<!\\)\"?;)
I had to play with a few example regex's that work with quoted strings. Finally realized that I am staring at negative look-behinds that avoid quotes that are escaped. This seems to solve any other escaped character, too, because escaped characters only appear inside unescaped quotes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
无需东张西望。只需仔细编写正则表达式即可精确匹配您的需要。通过以详细的自由间距模式编写此内容,可以使这一点变得更加清晰(并且更易于维护),如下所示:(尽管 VB.NET 语法使得这样做很尴尬)
此正则表达式演示了 Jeffrey Friedl 的 “展开循环”的使用“ 正确匹配可能包含转义字符的带引号字符串的高效技术。 (请参阅:MRE3)
哦,是的,还有一件事......伊卡洛斯找到了你!
No need for lookaround. Just carefully write the regex to precisely match what you need. This is made much clearer (and easier to maintain) by writing this in verbose free-spacing mode like so: (Although VB.NET syntax makes it awkward to do so)
This regex demonstrates use of Jeffrey Friedl's "Unrolling the Loop" efficiency technique for correctly matching quoted strings which may contain escaped characters. (See: MRE3)
Oh yeah, one more thing... Icarus has found you!