使用 regex/grep/awk/sed 解析复杂的调试日志

发布于 2025-01-12 14:21:43 字数 1660 浏览 1 评论 0原文

我的调试日志大小为 GB，并且包含大量无关数据。单个日志条目可能有 1,000,000 多行长，有些部分有缩进，有些没有，除了每个条目开头的开始时间戳之外，几乎没有一致性。每个新条目都以时间戳 ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah 开头，因此很容易识别，但可以有很多它后面的许多行都属于它。我一直在使用 python 来定位文本字符串，然后向上移动找到它们所属的父条目，向下移动到下一个 ^202[0-9]/[0-9] 实例所在的条目末尾不幸的是，{2}/[0-9]{2} blah blah 的性能不足以使该过程变得轻松。我现在试图让 grep 对正则表达式做同样的事情，因为 grep 在速度方面似乎处于不同的宇宙中。另外，我在我正在使用的机器上遇到了 python 版本差异的问题（2vs3），这真是一个痛苦。

这是我到目前为止对 grep 的了解，它适用于小型测试用例，但不适用于大文件，显然它在性能方面存在一些问题，我该如何解决这个问题？也许有一个用 awk 来做到这一点的好方法？

grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00: 00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2} |\Z)"我正在寻找的关键字符串是 00:00:00:fc:77:00

示例，

2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22 
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...

如果其中任何一个包含我的搜索字符串，我想要时间戳之间的整个片段，所有数千行。

原文

I have debug logs that are GB in size and contain lots of extraneous data. Single log entries can be 1,000,000+ lines long, some parts with indents, some without, there is very little consistency except for the beginning timestamp at the start of each entry. Each new entry starts with a timestamp ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah so it is easily identifiable but can have many many lines after it that belong to it. I've been using python to locate strings of text then move up find the parent entry they belong to and down to the end of the entry where the next instance of ^202[0-9]/[0-9]{2}/[0-9]{2} blah blah is located which is unfortunately not nearly performant enough to make this a painless process. I'm now trying to get grep to do the same with regex since grep seems to be in a different universe in terms of speed. Also I run into the issue of python version differences on machines I'm working on (2vs3) it's just a pain.

This is what I have so far for grep and it works in small test cases but not on large files, there are obviously some issues with it performance wise, how can I resolve this? Perhaps there's a good way to do this with awk?

grep -E "(?i)^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}[\s\S]+00:00:00:fc:77:00[\s\S]+?(?=^20[0-9]{2}\/[0-9]{2}\/[0-9]{2}|\Z)" the key string I'm looking for is 00:00:00:fc:77:00

sample

2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22 
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
... 
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency 
...

if any of these have my search string in them I want the whole piece between to the time stamps, all the many thousands of lines.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄荷→糖丶微凉 2025-01-19 14:21:43

假设：

只有时间戳行以 ^YYYY/MM/DD 开头，
时间戳行以格式为 YYYY/MM/DD HH:MM:SS.sss {; 的字符串开头。 } 并且此字符串在文件中是唯一的
搜索字符串不包含嵌入的换行符
搜索字符串保证不会被日志文件中的换行符破坏
单个日志条目（OP：可以是1,000,000 多行长）可能太大，无法放入内存，
需要搜索多个模式

设置：

$ cat search_strings
00:00:00:fc:77:00
and this line of text

$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

注意： log.txt 感兴趣的行以 < 开头code>^MATCH

一个 awk 想法需要两次遍历日志文件：

awk '
FNR==NR        { strings[$0]; next }

FNR==1         { pass++ }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3
                                           print_header=1 
                                         }

pass==1        { for (string in strings)
                     if (match($0,string)) {           # search for our strings in current line and if found ...
                        dttlist[dtt]                   # save current date/time/thread
                        next
                     }
               }

pass==2 && 
(dtt in dttlist) { if (print_header) {
                      print "################# matching block:"
                      print_header=0
                   }
                   print
                 }
' search_strings log.txt log.txt

假设内存使用不是问题，另一个 awk 想法需要一次遍历日志文件：

awk '

function print_lines() {

    if (lineno > 0)
       for (i=0; i<=lineno; i++)
           print lines[i]

    delete lines
    lineno=0
}

FNR==NR         { strings[$0]; next }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3 }

dtt != prev_dtt { delete lines
                  lineno=0
                  lines[0]="############# matching block:"

                  printme=0                     # disable printing of lines to stdout;
                                                # instead they will be saved to the lines[] array
                  prev_dtt=dtt
                }

! printme      { for (string in strings)
                     if (match($0,string)) { 
                        print_lines()          # flush any lines in the lines[] array and ...
                        printme=1              # set flag to print new lines to stdout
                        break
                     }
                 if (! printme)  {             # if not printing lines to stdout then ...
                    lines[++lineno]=$0         # save the current line in the lines[] array
                 }
               }
printme                                        # printme==1 => print current line to stdout
' search_strings log.txt

这两个文件都会生成：

################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

Assumptions:

only timestamp lines start with ^YYYY/MM/DD
a timestamp line starts with a string of the format YYYY/MM/DD HH:MM:SS.sss {<thread_name>} and this string is unique within the file
search strings do not contain embedded new lines
search strings are guaranteed not to be broken by newlines in the log file
a single log entry (OP: can be 1,000,000+ lines long) may be too large to fit into memory
need to search for multiple patterns

Setup:

$ cat search_strings
00:00:00:fc:77:00
and this line of text

$ cat log.txt
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:45.408 {Engine-Worker-16} <radiusItem.request-dump> Request packet dump:
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

NOTE: log.txt lines of interest start with ^MATCH

One awk idea that requires two passes through the log file:

awk '
FNR==NR        { strings[$0]; next }

FNR==1         { pass++ }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3
                                           print_header=1 
                                         }

pass==1        { for (string in strings)
                     if (match($0,string)) {           # search for our strings in current line and if found ...
                        dttlist[dtt]                   # save current date/time/thread
                        next
                     }
               }

pass==2 && 
(dtt in dttlist) { if (print_header) {
                      print "################# matching block:"
                      print_header=0
                   }
                   print
                 }
' search_strings log.txt log.txt

Assuming memory usage is not an issue, another awk idea requiring a single pass through the log file:

awk '

function print_lines() {

    if (lineno > 0)
       for (i=0; i<=lineno; i++)
           print lines[i]

    delete lines
    lineno=0
}

FNR==NR         { strings[$0]; next }

/^20[0-9][0-9]\/[0-1][0-9]\/[0-3][0-9] / { dtt = $1 FS $2 FS $3 }

dtt != prev_dtt { delete lines
                  lineno=0
                  lines[0]="############# matching block:"

                  printme=0                     # disable printing of lines to stdout;
                                                # instead they will be saved to the lines[] array
                  prev_dtt=dtt
                }

! printme      { for (string in strings)
                     if (match($0,string)) { 
                        print_lines()          # flush any lines in the lines[] array and ...
                        printme=1              # set flag to print new lines to stdout
                        break
                     }
                 if (! printme)  {             # if not printing lines to stdout then ...
                    lines[++lineno]=$0         # save the current line in the lines[] array
                 }
               }
printme                                        # printme==1 => print current line to stdout
' search_strings log.txt

Both of these generate:

################# matching block:
2022/01/28 17:58:45.408 {Engine-Worker-08} <radiusItem.request-dump> Request packet dump:
  Type=1, Ident=160, Len=54, Auth=7D 12 89 48 19 85 00 00 00 00 00 00 12 0C CC 22
MATCH  should match on this: 00:00:00:fc:77:00
...
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...
################# matching block:
2022/01/28 17:58:46.127 {TcpEngine-3} <tcp-service> Accept failure: Invalid Radius/TLS client 1.1.1.1, connection closed
MATCH  should match on this: 00:00:00:fc:77:00
################# matching block:
2022/01/28 17:58:48.604 {Engine-Worker-60} [acct:callAcctBkgFlow] <engine.item.setup> Call method ==> acct:readAcctPropFile
...
MATCH  should also match on this: and this line of text :if we've got the logic right
hundreds of thousands of lines of nonsense that might have my search string in it, with little to no consistency
...

回复收藏 0 原文

~没有更多了~