使用 PyParsing 解析 Snort 日志
使用 pyparsing 模块解析 Snort 日志时遇到问题。
问题在于分离 Snort 日志(其中有多行条目,由空行分隔)并让 pyparsing 将每个条目解析为一个整体块,而不是逐行读取并期望语法适用于每一行(显然,它没有。)
我尝试将每个块转换为临时字符串,删除每个块内的换行符,但它拒绝正确处理。我可能完全走错了路,但我不这么认为(类似的形式非常适合系统日志类型的日志,但这些是单行条目,因此适合您的基本文件迭代器/行处理)
这是一个到目前为止我拥有的日志和代码示例:
[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3]
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3 Code:10 DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP
[**] ...more like this [**]
以及更新的代码:
def snort_parse(logfile):
header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
pri = Suppress("[Priority:") + integer + Suppress("]")
date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
src_ip = ip_addr + Suppress("->")
dest_ip = ip_addr
extra = Regex(".*")
bnf = header + cls + pri + date + src_ip + dest_ip + extra
def logreader(logfile):
chunk = []
with open(logfile) as snort_logfile:
for line in snort_logfile:
if line !='\n':
line = line[:-1]
chunk.append(line)
continue
else:
print chunk
yield " ".join(chunk)
chunk = []
string_to_parse = "".join(logreader(logfile).next())
fields = bnf.parseString(string_to_parse)
print fields
任何帮助、指针、RTFM、你做错了等等,非常感谢。
Having a problem with parsing Snort logs using the pyparsing module.
The problem is with separating the Snort log (which has multiline entries, separated by a blank line) and getting pyparsing to parse each entry as a whole chunk, rather than read in line by line and expecting the grammar to work with each line (obviously, it does not.)
I have tried converting each chunk to a temporary string, stripping out the newlines inside each chunk, but it refuses to process correctly. I may be wholly on the wrong track, but I don't think so (a similar form works perfectly for syslog-type logs, but those are one-line entries and so lend themselves to your basic file iterator / line processing)
Here's a sample of the log and the code I have so far:
[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3]
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3 Code:10 DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP
[**] ...more like this [**]
And the updated code:
def snort_parse(logfile):
header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
pri = Suppress("[Priority:") + integer + Suppress("]")
date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
src_ip = ip_addr + Suppress("->")
dest_ip = ip_addr
extra = Regex(".*")
bnf = header + cls + pri + date + src_ip + dest_ip + extra
def logreader(logfile):
chunk = []
with open(logfile) as snort_logfile:
for line in snort_logfile:
if line !='\n':
line = line[:-1]
chunk.append(line)
continue
else:
print chunk
yield " ".join(chunk)
chunk = []
string_to_parse = "".join(logreader(logfile).next())
fields = bnf.parseString(string_to_parse)
print fields
Any help, pointers, RTFMs, You're Doing It Wrongs, etc., greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
产量
yields
您需要学习一些正则表达式,但希望这不会太痛苦。您的想法中最大的罪魁祸首是使用这种结构:
pyparsing 解析器中的每个子解析器几乎都是独立的,并且按顺序处理传入的文本。因此,Regex 术语无法向前查看下一个表达式来查看
'*'
重复应该在哪里停止。换句话说,表达式Regex(".*")
将一直读取到行尾,因为这是".*"
在没有指定的情况下停止的地方多行。在 pyparsing 中,这个概念是使用 SkipTo 实现的。以下是标题行的编写方式:
将“.*”问题更改为:
cls 的情况相同。
最后一个错误,您对日期的定义短了一个“:”+整数:
应该是:
我认为这些更改足以开始解析您的日志数据。
以下是一些其他样式建议:
您有很多重复的
Suppress("]")
表达式。我已经开始在一个非常紧凑且易于维护的语句中定义所有可抑制的标点符号,如下所示:(展开以添加您喜欢的任何其他标点符号)。现在我可以通过它们的符号名称来使用这些字符,并且我发现生成的代码更容易阅读。
您可以使用
header = Suppress("[**] [") + ...
开始标头。我从来不喜欢以这种方式看到在文字中嵌入空格,因为它绕过了 pyparsing 为您提供的自动空白跳过的一些解析稳健性。如果由于某种原因,“[**]”和“[”之间的空格更改为使用 2 或 3 个空格或制表符,那么您的抑制文字将会失败。将其与之前的建议结合起来,标题将以“我知道这是生成的文本”开头,因此这种格式的变化不太可能,但它更好地发挥了 pyparsing 的优势。
解析完字段后,开始将结果名称分配给解析器中的不同元素。这将使之后获取数据变得更加容易。例如,将 cls 更改为:
将允许您使用
fields.classification
访问分类数据。You have some regex unlearning to do, but hopefully this won't be too painful. The biggest culprit in your thinking is the use of this construct:
Each subparser within a pyparsing parser is pretty much standalone, and works sequentially through the incoming text. So the Regex term has no way to look ahead to the next expression to see where the
'*'
repetition should stop. In other words, the expressionRegex(".*")
is going to just read until the end of the line, since that is where".*"
stops without specifying multiline.In pyparsing, this concept is implemented using SkipTo. Here is how your header line is written:
Your ".*" problem gets resolved by changing it to:
Same thing for cls.
One last bug, your definition of date is short by one ':' + integer:
should be:
I think those changes will be sufficient to start parsing your log data.
Here are some other style suggestions:
You have a lot of repeated
Suppress("]")
expressions. I've started defining all my suppressable punctuation in a very compact and easy to maintain statement like this:(expand to add whatever other punctuation characters you like). Now I can use these characters by their symbolic names, and I find the resulting code a little easier to read.
You start off header with
header = Suppress("[**] [") + ...
. I never like seeing spaces embedded in literals this way, as it bypasses some of the parsing robustness pyparsing gives you with its automatic whitespace skipping. If for some reason the space between "[**]" and "[" was changed to use 2 or 3 spaces, or a tab, then your suppressed literal would fail. Combine this with the previous suggestion, and header would begin withI know this is generated text, so variation in this format is unlikely, but it plays better to pyparsing's strengths.
Once you have your fields parsed out, start assigning results names to different elements within your parser. This will make it a lot easier to get the data out afterward. For instance, change cls to:
Will allow you to access the classification data using
fields.classification
.好吧,我不知道 Snort 或 pyparsing,所以如果我说了一些愚蠢的话,请提前道歉。我不清楚问题是否在于 pyparsing 无法处理这些条目,或者您无法以正确的格式将它们发送到 pyparsing 。如果是后者,为什么不做这样的事情呢?
当然,如果您需要在将每个块发送到 pyparsing 之前对其进行修改,则可以在生成它之前执行此操作。
Well, I don't know Snort or
pyparsing
, so apologies in advance if I say something stupid. I'm unclear as to whether the problem is withpyparsing
being unable to handle the entries, or with you being unable to send them topyparsing
in the right format. If the latter, why not do something like this?Of course, if you need to modify each chunk before sending it to
pyparsing
, you can do so beforeyield
ing it.