将 AWK 正则表达式转换为 Python 脚本

发布于 2024-11-07 03:18:55 字数 2678 浏览 0 评论 0原文

大家早上好，想知道您是否可以帮我解决以下问题：- 上周末，我的一位同事向我展示了如何通过用 Python 重写 Bash 脚本来大幅缩短执行时间，之后我才开始学习 Python。我对它跑得有多快感到惊讶。我现在想用我拥有的另一个脚本做同样的事情。

该其他脚本读取日志文件并使用 AWK 从日志中过滤某些字段并将它们写入新文件。请参阅下面的脚本正在执行的正则表达式。我想用 Python 重写这个正则表达式，因为我的脚本目前需要大约 1 小时才能在大约 100,000 行的日志文件上执行。我想尽可能缩短这个时间。

cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done

此 AWK 命令获取如下所示的行：-

2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >

并输出如下所示的行：-

CC_SMS_SERVICE_51408 Submit_resp: 0

我尝试自己编写 Python 脚本，但在编写正则表达式时遇到了困难。到目前为止，我有以下内容：-

#!/usr/bin/python

# Import RegEx module
import re as regex
# Log file to work on
filetoread = open('/tmp/ pdu_log.log', "r")
# File to write output to
filetowrite =  file('/tmp/ pdu_log_clean.log', "w")
# Perform filtering in the log file
linetoread = filetoread.readlines()
for line in linetoread:
    filter0 = regex.sub(r"<G_","",line)
    filter1 = regex.sub(r"\."," ",filter0)
# Write new log file
    filetowrite.write(filter1)
filetowrite.close()
# Read new log and get required fields from it
filtered_log =  open('/tmp/ pdu_log_clean.log', "r")
filtered_line = filtered_log.readlines()
for line in filtered_line:
    token = line.split(" ")
    print token[0], token[1], token[5], token[13], token[20]
print "Done"

丑陋我知道，但请记住，我两天前才开始学习 Python。

我一直在这个小组和互联网上寻找我可以使用的代码片段，但到目前为止我发现的不符合我的需要或太复杂（至少对我来说）。

任何关于如何完成这项任务的建议和建议，我将不胜感激。

另外，你能推荐一本学习Python的好书吗？我已经阅读了 Swaroop CH 的《A Byte of Python》（很棒的入门书！），现在正在阅读 Mark Pilgrim 的《Dive into Python》。我正在寻找一本用简单的术语解释事情并且开门见山的书（类似于《Python简明》的编写方式）

提前致以

亲切的问候，

Junior

=====回答下面评论的Eli= ====

我很抱歉伙计们，我尝试对 Eli 的答案发表评论，但我的评论太长了，无法保存。我也尝试回复我自己的帖子，但由于我是新用户，我要 8 小时后才能回复！所以我唯一的选择是在我的帖子中添加编辑:)

无论如何，响应 Eli 的评论：-

好的，让我们看看，我的目标是从日志文件中过滤掉几个字段并将它们写入新的日志文件。正如我之前提到的，当前的日志文件有数千行，如下所示：-

2011-05-16 09:46:22,361 [Thread-4847133] PDU D

日志文件中的所有行都是相似的，并且它们都有相同的内容长度（相同数量的字段）。大多数字段都用空格分隔，除了我正在使用 AWK 处理的几个字段（删除“

我希望现在更清楚了

问候，

少年

原文

Good morning all,
Wondering if you could please help me with the following query:-
I have just started learning Python last weekend after a colleague of mine showed me how to dramatically cut the time a Bash script takes to execute by re-writing it in Python. I was amazed at how fast it ran. I would now like to do the same thing with another script I have.

This other script reads a log file and using AWK it filters certain fields from the log and writes them to a new file. See below the regex the script is executing. I would like to re-write this regex in Python as my script is currently taking about 1 hour to execute on a log file with about 100,000 lines. I would like to cut this time down as much as possible.

cat logs/pdu_log_fe.log | awk -F\- '{print $1,$NF}' | awk -F\. '{print $1,$NF}' | awk '{print $1,$4,$5}' | sort | uniq | while read service command status; do echo "Service: $service, Command: $command, Status: $status, Occurrences: `grep $service logs/pdu_log_fe.log | grep $command | grep $status | wc -l | awk '{ print $1 }'`" >> logs/pdu_log_fe_clean.log; done

This AWK command gets lines which look like this:-

2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >

And outputs lines like this:-

CC_SMS_SERVICE_51408 submit_resp: 0

I have tried writing the Python script myself but I am getting stuck writing the regex. So far I have the following:-

#!/usr/bin/python

# Import RegEx module
import re as regex
# Log file to work on
filetoread = open('/tmp/ pdu_log.log', "r")
# File to write output to
filetowrite =  file('/tmp/ pdu_log_clean.log', "w")
# Perform filtering in the log file
linetoread = filetoread.readlines()
for line in linetoread:
    filter0 = regex.sub(r"<G_","",line)
    filter1 = regex.sub(r"\."," ",filter0)
# Write new log file
    filetowrite.write(filter1)
filetowrite.close()
# Read new log and get required fields from it
filtered_log =  open('/tmp/ pdu_log_clean.log', "r")
filtered_line = filtered_log.readlines()
for line in filtered_line:
    token = line.split(" ")
    print token[0], token[1], token[5], token[13], token[20]
print "Done"

Ugly I know but please bear in mind that I have just started learning Python two days ago.

I have been looking on this group and on the Internet for snippets of code that I could use but so far what I have found do not fit my needs or are too complicated (at least for me).

Any suggestion, advice you can give me on how to accomplish this task will be greatly appreciated.

On another note, can you also recommend a good no-nonsense book to learn Python? I have read the book “A Byte of Python” by Swaroop C H (great introductory book!) and I am now reading “Dive into Python” by Mark Pilgrim. I am looking for a book that explains things in simple terms and goes straight to the point (similar to how “A Byte of Python” was written)

Thanks in advance

Kind regards,

Junior

=====Answer to Eli who commented below=====

My apologies guys, I tried commenting on Eli's answer but my comment is too long and it won't save. I also tried answering to my own post but as I am a new user I cannot answer until after 8 hours!. so my only option is to add an edit to my post :)

Anyways, in response to Eli's comment:-

Ok lets see, My aim is to filter out several fields from a log file and write them to a new log file. The current log file, as I mentioned previously, has thousands of lines like this:-

2011-05-16 09:46:22,361 [Thread-4847133] PDU D

All the lines in the log file are similar and they all have the same length (same amount of fields). Most of the fields are separated by spaces except for couple of them which I am processing with AWK (removing "

I hope this is clearer now

Regards,

Junior

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冬天旳寂寞 2024-11-14 03:18:55

由于这些行非常结构化，为了简单（和速度），我根本不会使用正则表达式。以下是提取第一条数据的示例：

>>> line = "2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >"
>>> istart = line.find('<G_')
>>> iend = line.find('.', istart)
>>> line[istart+3:iend]
'CC_SMS_SERVICE_51408_656'

可以类似地提取其他字段，具体取决于所有可能行的确切结构。很难理解您的 AWK 到底做了什么以及它如何应用于您提供的示例。如果您可以描述数据行的结构以及您到底需要提取什么，那就更容易了。

例如，用空格分割行（split 的默认值），您将得到：

>>> line.split()
['2011-05-16', '09:46:22,361', '[Thread-4847133]', 'PDU', 'D', '<G_CC_SMS_SERVICE_51408_656.O_', 'CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX', '-', '2011-05-16', '09:46:22', '-', 'OUT', '-', '(submit_resp:', '(pdu:', 'L:', '53', 'ID:', '80000004', 'Status:', '0', 'SN:', '25866)', '98053090-7f90-11e0-a2da-00238bce423b', '(opt:', ')', ')', '>']

现在您几乎可以自由地从这里提取您需要的任何字段，只要（如您所说）格式非常固定，并且总是相同的字段。所以：

>>> line.split()[13]
'(submit_resp:'

清理一下：

>>> line.split()[13].lstrip('(').rstrip(':')
'submit_resp'

如您所见，可能性是无限的。我建议您在沉迷于正则表达式之前先熟悉一下 Python 的字符串处理功能。正则表达式很有用，但它们并不是完成这项工作的唯一工具。通常，基于替代字符串处理技术的解决方案更快、更容易理解。当然，您始终可以使用正则表达式来补充它们。

PS 有关学习 Python 的书籍/资源 - 有很多关于此的问题。从此处开始并浏览。

Since these lines are very structured, for simplicity (and speed), I would not go for a regex at all. Here's an example extracting your first piece of data:

>>> line = "2011-05-16 09:46:22,361 [Thread-4847133] PDU D <G_CC_SMS_SERVICE_51408_656.O_ CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX - 2011-05-16 09:46:22 - OUT - (submit_resp: (pdu: L: 53 ID: 80000004 Status: 0 SN: 25866) 98053090-7f90-11e0-a2da-00238bce423b (opt: ) ) >"
>>> istart = line.find('<G_')
>>> iend = line.find('.', istart)
>>> line[istart+3:iend]
'CC_SMS_SERVICE_51408_656'

Other fields can be extracted similarly, depending on the exact structure of all possible lines. It's hard to understand what your AWK does exactly and how it applies to the example you provided. It would be easier if you could describe the structure of your data lines and what exactly you need to extract.

For example, splitting the line by whitespace (the default for split) you get:

>>> line.split()
['2011-05-16', '09:46:22,361', '[Thread-4847133]', 'PDU', 'D', '<G_CC_SMS_SERVICE_51408_656.O_', 'CC_SMS_SERVICE_51408_656-ServerThread-VASPSessionThread-7ee35fb0-7e87-11e0-a2da-00238bce423b-TRX', '-', '2011-05-16', '09:46:22', '-', 'OUT', '-', '(submit_resp:', '(pdu:', 'L:', '53', 'ID:', '80000004', 'Status:', '0', 'SN:', '25866)', '98053090-7f90-11e0-a2da-00238bce423b', '(opt:', ')', ')', '>']

Now you're pretty much free to extract whichever fields you need from here, as long as (as you say) the format is very fixed and it's always the same fields. So:

>>> line.split()[13]
'(submit_resp:'

Cleaning up a bit:

>>> line.split()[13].lstrip('(').rstrip(':')
'submit_resp'

As you can see, the possibilities are limitless. I suggest you get familiar with Python's string processing capabilities before you engorge yourself in regexes. Regexes are useful, but they're not the only tool for the job. Often, solutions based on alternative string processing techniques are faster and easier to understand. You can always supplement them with regexes, of course.

P.S. For books/resources on learning Python - there are many SO questions on this. Start here and browse.

回复收藏 0 原文

~没有更多了~