使用sed/awk和正则表达式处理日志
我有 1000 个由非常详细的 PHP 脚本生成的日志文件。一般结构如下
###Unknown no of lines, which I want to ignore###
=================================================
$insert_vars['cdr_pkey']=17568
$id<TAB>$g1<TAB>$i1<tab>rating1<TAB>$g2<TAB>$i2<tab>rating2 #<TAB>more $gX,$iX,$ratingX
#numerical values of $id $g1 $i1 etc. separated by tab
#numerical values of ---""---
#I do not know how many lines will be there (unique column is $id)
=================================================
###Unknown no of lines, which I want to ignore###
我必须处理这些日志文件并创建一个 Excel 工作表(我正在考虑 csv 格式)并报告数据。我真的不擅长 Excel,但我想输出类似的内容:
cdr_pkey<TAB>id<TAB>g1<TAB>i1<TAB>rating1<TAB>g2<TAB>rating2 #and so on
17568<TAB>1349<TAB>0.0004532<TAB>0.01320<TAB>2.014E-4<TAB>...#rest of numerical values
17568<TAB>1364<TAB>...#values for id=1364
17568<TAB>1321<TAB>...#values for id=1321
...
17569<TAB>1048<TAB>...#values for id=1048
17569<TAB>1426<TAB>...#values for id=1426
...
...
所以我的 cdr_pkey 是工作表中唯一的列,对于每个 $cdr_pkey
,我有多个 $id
s,每个都有自己的一组 $g1,$i1,$ rating1...
经过测试,该格式可以被excel读取。现在我只想将其扩展到所有这 1000 个文件。
我只是不确定如何进一步进行。下一步是什么?
I have 1000s of log files generated by a very verbose PHP script. The general structure is as follows
###Unknown no of lines, which I want to ignore###
=================================================
$insert_vars['cdr_pkey']=17568
$id<TAB>$g1<TAB>$i1<tab>rating1<TAB>$g2<TAB>$i2<tab>rating2 #<TAB>more $gX,$iX,$ratingX
#numerical values of $id $g1 $i1 etc. separated by tab
#numerical values of ---""---
#I do not know how many lines will be there (unique column is $id)
=================================================
###Unknown no of lines, which I want to ignore###
I have to process these log files and create an excel sheet (I am thinking csv format) and report the data back. I am really bad at excel, but I thought of outputting something like :
cdr_pkey<TAB>id<TAB>g1<TAB>i1<TAB>rating1<TAB>g2<TAB>rating2 #and so on
17568<TAB>1349<TAB>0.0004532<TAB>0.01320<TAB>2.014E-4<TAB>...#rest of numerical values
17568<TAB>1364<TAB>...#values for id=1364
17568<TAB>1321<TAB>...#values for id=1321
...
17569<TAB>1048<TAB>...#values for id=1048
17569<TAB>1426<TAB>...#values for id=1426
...
...
So my cdr_pkey is unique column in the sheet, and for each $cdr_pkey
, I have multiple $id
s, each having their own set of $g1,$i1,$rating1...
After testing such format, it can be read by excel. Now I just want to extend it to all those 1000s of files.
I am just not sure how to proceed further. What's the next step?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以下 bash 脚本执行的操作可能与您想要的操作相关。它是根据您所说的
时的含义来参数化的。我假设您指的是 ascii 制表符,但如果您的日志非常详细,以至于拼写出
您将需要相应地修改变量$WHAT_DID_YOU_MEAN_BY_TAB
。请注意,此脚本很少执行 The Right Thing™;它将整个文件读取到一个字符串变量中,这甚至可能是不可能的,具体取决于日志文件有多大。从好的方面来说,如果您认为更好的话,可以轻松修改脚本以进行两次传递。以下
find
命令是一个示例,但您的情况取决于日志的组织方式。<代码>查找 . LOG_PATTERN -exec THIS_SCRIPT '{}' \;
最后,我忽略了将 CSV 标头放在输出上的问题。这很容易在带外完成。
(编辑:更新了脚本以反映评论中的讨论。)
The following bash script does something that might be related to what you want. It is parameterized by what you meant when you said
<TAB>
. I assume you mean the ascii tab character, but if your logs are so verbose that they spell out<TAB>
you will need to modify the variable$WHAT_DID_YOU_MEAN_BY_TAB
accordingly. Note that there is very little about this script that does The Right Thing™; it reads the entire file into a string variable, which might not even be possible depending on how big your log files are. On the up side, the script could be easily modified to make two passes, instead, if you think that's better.The following
find
command is an example use, but your case will depend on how your logs are organized.find . LOG_PATTERN -exec THIS_SCRIPT '{}' \;
Lastly, I have ignored the issue of putting the CSV headers on the output. This is easily done out-of-band.
(Edit: updated the script to reflect discussion in the comments.)
编辑:詹姆斯告诉我,将最后一个
echo
中的sed
从... 1d ...
更改为... 1, 2 ...
并删除grep -v 'id'
应该可以解决问题。确认有效。所以下面改一下。再次感谢詹姆斯·威尔科克斯。
Based on @James script this is what I came up with. I just piped the final echo to
grep -v 'id'
Thanks again James Wilcox
EDIT: James tells me that changing the
sed
in lastecho
from... 1d ...
to... 1,2 ...
and dropping thegrep -v 'id'
should do the trick.Confirmed that it works. So changing it below. Thanks again to James Wilcox.
Based on @James script this is what I came up with. I just piped the final echo to
grep -v 'id'
Thanks again James Wilcox