awk 解析邮件日志中的唯一 IP 地址
昨天我在这里问了一个关于oneliner 和 mjschultz 给了我一个答案,我立刻就爱上了:) awk 刚刚破坏了手头的任务,解析了几秒钟内即可处理大型日志文件(500+ MB)。现在我正在尝试将其他 oneliners 移植到 awk。
这是有问题的:
grep "pop3\[" maillog | grep "User logged in" |
egrep -o '([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}' | sort -u
我需要使用 pop3 连接到邮件服务器的所有唯一 IP 地址的列表。
这是一个示例日志条目:
Nov 15 00:49:21 hostname pop3[19418]: login: [10.10.10.10] username plaintext
User logged in
因此,我找到所有包含“pop3”的行,并解析它们以查找“用户登录”部分。接下来,我使用egrep和正则表达式来匹配IP地址,并使用排序来过滤掉重复的地址。
这是我到目前为止的 awk 版本:
awk '/pop3\[.*.User logged in/ {ip[$7]=0} END {for (address in ip)
{ print address} }' maillog
这工作得很好,但一如既往,并非所有日志条目都是相同的,例如有时 IP 会移动到第 8 个字段,如下所示:
Nov 15 10:42:40 hostname pop3[2232]: login: hostname.domain.com [20.20.20.20]
username plaintext User logged in
What would be the best way to catch those items with还有 awk 吗?
一如既往地感谢您提前做出的所有精彩回复,您已经教会了我很多东西:)
Yesterday I asked a question here about a oneliner and mjschultz gave me an answer that I instantly fell in love with :) Awk just destroyed the task at hand, parsing a large logfile (500+ MB) in a matter of seconds. Now I'm trying to port my other oneliners to awk.
This is the one in question:
grep "pop3\[" maillog | grep "User logged in" |
egrep -o '([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}' | sort -u
I need the list of all unique IP addresses using pop3 to connect to the mail server.
This is an example log entry:
Nov 15 00:49:21 hostname pop3[19418]: login: [10.10.10.10] username plaintext
User logged in
So I find all the lines containing "pop3" and I parse them for the "User logged in" part. Next i use egrep and a regex to match IP addresses and I use sort to filter out the duplicate addresses.
This is what I have so far for my awk version:
awk '/pop3\[.*.User logged in/ {ip[$7]=0} END {for (address in ip)
{ print address} }' maillog
This works perfectly but as always not all log entries are identical, for example sometimes the IP gets moved to the 8th field like here:
Nov 15 10:42:40 hostname pop3[2232]: login: hostname.domain.com [20.20.20.20]
username plaintext User logged in
What would be the best way to catch those entries with awk as well?
As always thanks for all the great responses in advance, you've taught me so much already :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
AWK代码
只需匹配您的ip格式...注意没有其他格式...
在ideone
AWK code
just match your ip format ... be careful that there are no other formats ...
running at ideone
对我来说,这看起来更像是 Perl 领域,而不是 Awk:
排序不太完美 - 按字母顺序而不是数字顺序(因此 192.1.168.10 将出现在 9.25.13.26 之前)。当然,这是可以解决的。
That looks more like Perl territory than Awk to me:
The sort is less than perfect - being alphabetic rather than numeric (so 192.1.168.10 will appear before 9.25.13.26). That can be fixed, of course.
在看到并尝试了这些方法之后,我有了一个新想法。
belisarius 的代码满足了我的要求,但由于它必须执行所有正则表达式匹配,因此它不是最快的代码,速度才是我所追求的。
所以我想出了这个,因为你可以看到“有问题的”日志行有一个额外的字段,使它们全部有 13 个字段长而不是正常的 12 个字段,所以我只是删除了额外的字段,这给了我正确的 IP 列表地址,接下来我再次使用 awk 删除所有重复条目:
Ideone 链接 如果您想查看正在运行的代码
After seeing and trying these approaches I got a new idea.
belisarius's code does what I asked for but since it has to do all the regex matching it's not the fastest one and speed is what I'm after.
So I came up with this, as you can see the "problematic" log lines have an extra field, making them all 13 fields long instead of the normal 12, so I just delete the extra field, this gives me the correct list of IP addresses, next i use awk again to delete all duplicate entries:
Ideone link if you want to see the code in action