使用 regex/grep/awk/sed 解析复杂的调试日志
我的调试日志大小为 GB,并且包含大量无关数据。单个日志条目可能有 1,000,000 多行长,有些部分有缩进,有些没有,除了每个条目开头的开始时间戳之外,几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
开头,因此很容易识别,但可以有很多它后面的许多行都属于它。我一直在使用 python 来定位文本字符串,然后向上移动找到它们所属的父条目,向下移动到下一个 ^202[0-9]/[0-9] 实例所在的条目末尾不幸的是,{2}/[0-9]{2} blah blah
的性能不足以使该过程变得轻松。我现在试图让 grep 对正则表达式做同样的事情,因为 grep 在速度方面似乎处于不同的宇宙中。另外,我在我正在使用的机器上遇到了 python 版本差异的问题(2vs3),这真是一个痛苦。
这是我到目前为止对 grep 的了解,它适用于小型测试用例,但不适用于大文件,显然它在性能方面存在一些问题,我该如何解决这个问题?也许有一个用 awk 来做到这一点的好方法?
grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00: 00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2} |\Z)"我正在寻找的关键字符串是 00:00:00:fc:77:00
示例,
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
如果其中任何一个包含我的搜索字符串,我想要时间戳之间的整个片段,所有数千行。
I have debug logs that are GB in size and contain lots of extraneous data. Single log entries can be 1,000,000+ lines long, some parts with indents, some without, there is very little consistency except for the beginning timestamp at the start of each entry. Each new entry starts with a timestamp ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
so it is easily identifiable but can have many many lines after it that belong to it. I've been using python to locate strings of text then move up find the parent entry they belong to and down to the end of the entry where the next instance of ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah
is located which is unfortunately not nearly performant enough to make this a painless process. I'm now trying to get grep to do the same with regex since grep seems to be in a different universe in terms of speed. Also I run into the issue of python version differences on machines I'm working on (2vs3) it's just a pain.
This is what I have so far for grep and it works in small test cases but not on large files, there are obviously some issues with it performance wise, how can I resolve this? Perhaps there's a good way to do this with awk?
grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)"
the key string I'm looking for is 00:00:00:fc:77:00
sample
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
if any of these have my search string in them I want the whole piece between to the time stamps, all the many thousands of lines.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设:
^YYYY/MM/DD
开头,YYYY/MM/DD HH:MM:SS.sss {; 的字符串开头。 }
并且此字符串在文件中是唯一的设置:
注意:
log.txt
感兴趣的行以 < 开头code>^MATCH一个
awk
想法需要两次遍历日志文件:假设内存使用不是问题,另一个
awk
想法需要一次遍历日志文件:这两个文件都会生成:
Assumptions:
^YYYY/MM/DD
YYYY/MM/DD HH:MM:SS.sss {<thread_name>}
and this string is unique within the fileSetup:
NOTE:
log.txt
lines of interest start with^MATCH
One
awk
idea that requires two passes through the log file:Assuming memory usage is not an issue, another
awk
idea requiring a single pass through the log file:Both of these generate: