当前位置：文江博客话题详情

UNIX Shell xml-parsing

使用 grep 解析日志的 unix shell 脚本

发布于 2024-10-02 07:00:49 字数 1774 浏览 6 评论 0 原文

events.log 的内容：

<log>  
 <time>09:00:30</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>john</id>  
</log>  
<log>
 <time>09:00:35</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>steve</id>  
</log>  
<log>  
 <time>09:00:40</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>john</id>  
</log>

我想提取所有带有的条目的entry1和entry2标签'约翰' 到一个文件中。我想在 shell 脚本中执行此操作，该脚本将查找目录中的所有 *.log 文件。输出应类似于以下内容。

a.out 的内容：

<time>09:00:30</time>   
<entry1>abcd</entry1>  
<entry2>abcd</entry2>

<time>09:00:40</time>  
<entry1>abcd</entry1>  
<entry2>abcd</entry2>

我是 shell 脚本编写的新手，但是我尝试使用一些基本命令至少查看日志：

$ grep -B 3 -in '<id>john</id>' * > /tmp/a.out

上面的命令为我提供了 john 的 id 标记上方 3 行的输出，如下所示

...   
events111.log-100- <time>09:00:40</time>  
events111.log-101- <entry1>abcd</entry1>  
events111.log-102- <entry2>abcd</entry2>  
events111.log-103- <id>john</id>  
....  
events112.log-200- <time>06:56:03</time>  
events112.log-201- <entry1>abcd</entry1>  
events112.log-202- <entry2>abcd</entry2>  
events112.log-203- <id>john</id>

这很好，但有问题-3 行不会每次都起作用，中间可能有更多标签，因此需要一些解析逻辑来找出从到 的文本;。

我真的很感激一些关于为此制定脚本的帮助。

谢谢！

原文

Contents of events<xyz>.log:

<log>  
 <time>09:00:30</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>john</id>  
</log>  
<log>
 <time>09:00:35</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>steve</id>  
</log>  
<log>  
 <time>09:00:40</time>  
 <entry1>abcd</entry1>  
 <entry2>abcd</entry2>  
 <id>john</id>  
</log>

I want to extract entry1 and entry2 tags of all <log> entries with <id> 'john' into a file. i want to do this in a shell script which would look in all *.log files in a directory. The output should be similar to the following.

Contents of a.out:

<time>09:00:30</time>   
<entry1>abcd</entry1>  
<entry2>abcd</entry2>

<time>09:00:40</time>  
<entry1>abcd</entry1>  
<entry2>abcd</entry2>

I am new to shell scripting, however I tried with some basic commands to at least look at the logs:

$ grep -B 3 -in '<id>john</id>' * > /tmp/a.out

above command gives me output with 3 lines above id tag for john as follows

...   
events111.log-100- <time>09:00:40</time>  
events111.log-101- <entry1>abcd</entry1>  
events111.log-102- <entry2>abcd</entry2>  
events111.log-103- <id>john</id>  
....  
events112.log-200- <time>06:56:03</time>  
events112.log-201- <entry1>abcd</entry1>  
events112.log-202- <entry2>abcd</entry2>  
events112.log-203- <id>john</id>

This is fine, but the problem is that -3 lines wont work every time, there could be more tags in between, so there is some parsing logic needed to find out the text from <time> to </id>.

I would really appreciate some help around formulating a script for this.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一袭水袖舞倾城 2024-10-09 07:00:49

您是否考虑过使用 xml grep 工具，例如 xml starlet 来从中挑选出片段这些日志文件？会干净很多。

回复收藏 0 原文

太阳男子 2024-10-09 07:00:49

使用 shell 脚本执行此操作并不是真正适合该工作的工具。你确实需要一个解析器。这是 python 中的单个文件。您可以围绕此进行循环并执行整个日志文件目录。

#!/usr/bin/env python
import sys
from BeautifulSoup import BeautifulSoup, Tag   

f = open(sys.argv[1], 'r')   
soup = BeautifulSoup(f.read())    
for log in soup.findAll('log'):
 if log.id.contents[0] == "john":
   print log.entry1
   print log.entry2

Doing this with a shell script is not really the right tool for the job. You really need a parser. Here's one in python for a single file. You could throw a loop around this and do an entire directory of log files.

#!/usr/bin/env python
import sys
from BeautifulSoup import BeautifulSoup, Tag   

f = open(sys.argv[1], 'r')   
soup = BeautifulSoup(f.read())    
for log in soup.findAll('log'):
 if log.id.contents[0] == "john":
   print log.entry1
   print log.entry2

回复收藏 0 原文

网名女生简单气质 2024-10-09 07:00:49

has() { echo "$line" | grep "$1" >/dev/null; }
while read line; do
 has /log && echo;
 (has time   || has entry1 || has entry2) && echo "$line";
done;

您可能想

<time>09:00:30</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

<log> <time>09:00:35</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

<time>09:00:40</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

也可能不想抑制“time”行中的“”。

has() { echo "$line" | grep "$1" >/dev/null; }
while read line; do
 has /log && echo;
 (has time   || has entry1 || has entry2) && echo "$line";
done;

prints

<time>09:00:30</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

<log> <time>09:00:35</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

<time>09:00:40</time>
<entry1>abcd</entry1>
<entry2>abcd</entry2>

You may or may not want to suppress that "<log>" in the "time" line.

回复收藏 0 原文