创建动态的“可变布局” Python 中的 split()

发布于 2024-11-29 00:03:16 字数 566 浏览 2 评论 0原文

我有一个解析 IIS 日志的脚本，目前它使用 split 将 IIS 字段值放入多个变量中逐一获取日志行，如下所示：

date, time, sitename, ip, uri_stem, whatever = log_line.split(" ")

而且，它对于默认设置运行良好。但是，如果其他人使用不同的日志字段布局（不同的顺序或不同的日志字段，或两者兼而有之），他将必须在源代码中找到此行并进行修改。这个人还必须知道如何修改它，以便不会出现任何中断，因为显然这个变量稍后会在代码中使用。

我如何才能使这个更通用，以某种方式包含某种列表，其中包含用户可以修改的 IIS 日志字段布局（配置变量，或脚本开头的字典/列表），该列表稍后将被修改用于保存日志行值？这就是我所认为的——“动态”。我正在考虑也许使用 for 循环和字典来做到这一点，但我想与使用 split() 相比，它会对性能产生很大的影响，或者不是吗？有人对如何/应该如何做到这一点有建议吗？

是否值得这么麻烦，或者我应该为使用该脚本的任何人记下在哪里更改包含 log_line.split() 的代码行、如何操作以及要注意什么？

谢谢。

原文

I have a script that parses IIS logs and at the moment it fetches log lines one by one using split to put IIS field values into multiple variables like this:

date, time, sitename, ip, uri_stem, whatever = log_line.split(" ")

And, it works fine for the default setup. But, if someone else uses a different log field layout (different order, or different log fields, or both) he would have to go and find this line in the source and modify it. This person would also have to know how to modify it so that nothing breaks since obviously this variables are used later in the code.

How could I make this more generic in a way of having some kind of a list that would contain IIS log field layout which a user could modify (a config variable, or a dict/list at the beginning of the script) that would later be used to hold log line values? That is what I consider - "dynamic". I was thinking of maybe using for-loop and a dictionary to do that, but I imagine it would have a big impact on performance compared to using split(), or wouldn't it? Does anyone have a suggestion on how this could/should be done?

Is it even worth the trouble, or should I just make a note for anyone that uses the script on where to change the code line that contains log_line.split(), how to do it and what to pay attention to?

Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌若浮生 2024-12-06 00:03:16

如果仅字段的顺序可能不同，则可以处理每行的验证并自动调整信息的提取以适应检测到的顺序。
我认为在正则表达式的帮助下很容易做到这一点。

如果不仅顺序不同，字段的数量和性质也可能不同，我认为仍然可以做同样的事情，但前提是提前知道可能的字段。

常见的条件是，这些字段必须具有足够强大的“个性”，以便易于区分。

如果没有更精确的信息，没有人可以走得更远，IMO

，8 月 15 日星期一 9:39 GMT+0:00

似乎有一个错误 < em>spilp.py：
它必须是

with codecs.open(file_path, 'r',encoding='utf-8',errors='ignore') as log_lines:
不
with open(file_path, 'r',encoding='utf-8',errors='ignore') 作为 log_lines:
后者使用内置 open() ，它没有相关关键字

Monday, 15 August 16:10 GMT+0:00

目前，在示例文件中，字段按以下顺序排列：

日期
时间
s-站点名称
s-ip
cs方法
cs-uri-stem
cs-uri-查询
运动
cs-用户名
c-ip
cs(用户代理)
sc-状态
sc-子状态
sc-win32-状态

。

假设您想按以下顺序提取每行的值：

s-端口
时间
日期
s-站点名称
s-ip
cs(用户代理)
sc-状态
sc-子状态
sc-win32-状态
c-ip
cs-用户名
cs方法
CS URI 干
cs-uri-查询

以相同的顺序将它们分配给以下标识符：

s_端口
时间
日期
s_站点名称
s_ip
cs_user_agent
sc_status
sc_substatus
sc_win32_status
c_ip
cs_用户名
cs_方法
cs_uri_stem
cs_uri_query

使用

s_port,
time, date,
s_sitename, s_ip,
cs_user_agent, sc_status, sc_substatus, sc_win32_status,
c_ip,
cs_username,
cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)

函数 line_spliter()

我知道，我知道，您想要的是相反的：按照文件当前的顺序恢复文件中读取的值，以防万一与当前通用顺序不同的文件。

但我仅以此为例，目的是让示例文件保持原样。否则，我需要创建一个具有不同顺序值的其他文件来公开示例。

无论如何，算法不依赖于示例。它取决于定义执行正确分配所必须获得的值的连续性所需的顺序。

在我的代码中，这个所需的顺序是用对象 ref_fields 设置的，

我认为我的代码及其执行本身就可以理解原理。

import re


ref_fields = ['s-port',
              'time','date', 
              's-sitename', 's-ip',
              'cs(User-Agent)', 'sc-status', 
              'sc-substatus', 'sc-win32-status',
              'c-ip',
              'cs-username',
              'cs-method', 'cs-uri-stem', 'cs-uri-query']

print 'REF_FIELDS :\n------------\n%s\n' % '\n'.join(ref_fields)


############################################
file_path = 'I:\\sample[1].log'                  # Path to put here
############################################


with open(file_path, 'r') as log_lines:
    line = ''
    while line[0:8]!='#Fields:':
        line = next(log_lines)
    # At this point, line is the line containing the fields keywords
    print 'line of the fields keywords:\n----------------------------\n%r\n' % line

    found_fields = line.split()[1:]
    len_found_fields = len(found_fields)
    regex_extractor = re.compile('[ \t]+'.join(len_found_fields*['([^ \t]+)']))
    print 'list found_fields of keywords in the file:\n------------------------------------------\n%s\n' % found_fields

    print '\nfound_fields == ref_fields  is ',found_fields == ref_fields




    if found_fields == ref_fields:
        print '\nNORMAL ORDER\n------------'
        def line_spliter(line):
            return line.split()

    else:
        the_order = [ found_fields.index(fild) + 1 for fild in ref_fields]
        # the_order is the list of indexes localizing the elements of ref_fields 
        # in the order in which they succeed in the actual line of found fields keywords
        print '\nSPECIAL ORDER\n-------------\nthe_order == %s\n\n\n======================' % the_order
        def line_spliter(line):
            return regex_extractor.match(line).group(*the_order)



    for i in xrange(1):
        line = next(log_lines)
        (s_port,
        time, date,
        s_sitename, s_ip,
        cs_user_agent, sc_status, sc_substatus, sc_win32_status,
        c_ip,
        cs_username,
        cs_method, cs_uri_stem, cs_uri_query) = line_spliter(line)
        print ('LINE :\n------\n'
               '%s\n'
               'SPLIT LINE :\n--------------\n'
               '%s\n\n'
               'REORDERED SPLIT LINE :\n-------------------------\n'
               '%s\n\n'
               'EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :\n-------------------------------------------\n'
               'date == %s\n'
               'time == %s\n'
               's_port == %s\n'
               'c_ip == %s\n\n'
               '======================') % (line,'\n'.join(line.split()),line_spliter(line),date,time,s_port,c_ip)




# ---- split each logline into multiple variables, populate dictionaries and db ---- #      
def splitLogline(log_line):
        # needs to be dynamic (for different logging setups)
        s_port,
        time, date,
        s_sitename, s_ip,
        cs_user_agent, sc_status, sc_substatus, sc_win32_status,
        c_ip,
        cs_username,
        cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)

结果

REF_FIELDS :
------------
s-port
time
date
s-sitename
s-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
c-ip
cs-username
cs-method
cs-uri-stem
cs-uri-query

line of the fields keywords:
----------------------------
'#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status \n'

list found_fields of keywords in the file:
------------------------------------------
['date', 'time', 's-sitename', 's-ip', 'cs-method', 'cs-uri-stem', 'cs-uri-query', 's-port', 'cs-username', 'c-ip', 'cs(User-Agent)', 'sc-status', 'sc-substatus', 'sc-win32-status']


found_fields == ref_fields  is  False

SPECIAL ORDER
-------------
the_order == [8, 2, 1, 3, 4, 11, 12, 13, 14, 10, 9, 5, 6, 7]


======================
LINE :
------
2010-01-01 00:00:03 SITENAME 192.168.1.1 GET /news-views.aspx - 80 - 66.249.72.135 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200 0 0

SPLIT LINE :
--------------
2010-01-01
00:00:03
SITENAME
192.168.1.1
GET
/news-views.aspx
-
80
-
66.249.72.135
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
200
0
0

REORDERED SPLIT LINE :
-------------------------
('80', '00:00:03', '2010-01-01', 'SITENAME', '192.168.1.1', 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)', '200', '0', '0\n', '66.249.72.135', '-', 'GET', '/news-views.aspx', '-')

EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :
-------------------------------------------
date == 2010-01-01
time == 00:00:03
s_port == 80
c_ip == 66.249.72.135

======================

此代码仅适用于文件中的字段被打乱的情况，但其数量与正常的已知字段列表相同。

可能会发生其他情况，例如文件中的值少于已知和等待的字段。如果您需要针对这些其他情况的更多帮助，请解释可能会发生哪些情况，我将尝试调整代码。

。

我想我会对我在 spilp.py 中快速阅读的代码做很多评论。当我有时间的时候我会写它们。

If only the order of the fields may vary, it is possible to process a verification of each line and to automatically adapt the extraction of information to the detected order.
I think it would be easy to do so with the help of regex.

If not only the order, but the number and nature of fields may vary, I think it would still be possible to do the same, but at the condition to know in advance the possible fields.

And the common condition is that the fields must have "personalities" strong enough to be easily distinguishable

Without more precise information, nobody can go further, IMO

Monday, 15 August 9:39 GMT+0:00

It seems there is an error in spilp.py :
it must be

with codecs.open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
not
with open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
The latter uses the builtin open() which has not the keywords in question

Monday, 15 August 16:10 GMT+0:00

Presently , in the sample file, the fields are in this order:

date
time
s-sitename
s-ip
cs-method
cs-uri-stem
cs-uri-query
s-port
cs-username
c-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status

Suppose you want to extract the values of each line in the following order:

s-port
time
date
s-sitename
s-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
c-ip
cs-username
cs-method
cs-uri-stem
cs-uri-query

to assign them to the following identifiers in the same order:

s_port
time
date
s_sitename
s_ip
cs_user_agent
sc_status
sc_substatus
sc_win32_status
c_ip
cs_username
cs_method
cs_uri_stem
cs_uri_query

doing

s_port,
time, date,
s_sitename, s_ip,
cs_user_agent, sc_status, sc_substatus, sc_win32_status,
c_ip,
cs_username,
cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)

with a function line_spliter()

I know, I know, what you want is the contrary: to restore the values read in a file in the order they have presently is the file, in case there is a file with a different order than the generic present one.

But I take this only as example, in the aim to let the sample file as is. Otherwise I would need to create an other file with different order of values to expose an example.

Anyway, the algorithm doesn't depend of the example. It depends of the desired order in which one defines succession of the values that must be obtained to do a correct assignement.

In my code , this desired order is set with the object ref_fields

I think that my code and its execution speak themselves to make understand the principle.

import re


ref_fields = ['s-port',
              'time','date', 
              's-sitename', 's-ip',
              'cs(User-Agent)', 'sc-status', 
              'sc-substatus', 'sc-win32-status',
              'c-ip',
              'cs-username',
              'cs-method', 'cs-uri-stem', 'cs-uri-query']

print 'REF_FIELDS :\n------------\n%s\n' % '\n'.join(ref_fields)


############################################
file_path = 'I:\\sample[1].log'                  # Path to put here
############################################


with open(file_path, 'r') as log_lines:
    line = ''
    while line[0:8]!='#Fields:':
        line = next(log_lines)
    # At this point, line is the line containing the fields keywords
    print 'line of the fields keywords:\n----------------------------\n%r\n' % line

    found_fields = line.split()[1:]
    len_found_fields = len(found_fields)
    regex_extractor = re.compile('[ \t]+'.join(len_found_fields*['([^ \t]+)']))
    print 'list found_fields of keywords in the file:\n------------------------------------------\n%s\n' % found_fields

    print '\nfound_fields == ref_fields  is ',found_fields == ref_fields




    if found_fields == ref_fields:
        print '\nNORMAL ORDER\n------------'
        def line_spliter(line):
            return line.split()

    else:
        the_order = [ found_fields.index(fild) + 1 for fild in ref_fields]
        # the_order is the list of indexes localizing the elements of ref_fields 
        # in the order in which they succeed in the actual line of found fields keywords
        print '\nSPECIAL ORDER\n-------------\nthe_order == %s\n\n\n======================' % the_order
        def line_spliter(line):
            return regex_extractor.match(line).group(*the_order)



    for i in xrange(1):
        line = next(log_lines)
        (s_port,
        time, date,
        s_sitename, s_ip,
        cs_user_agent, sc_status, sc_substatus, sc_win32_status,
        c_ip,
        cs_username,
        cs_method, cs_uri_stem, cs_uri_query) = line_spliter(line)
        print ('LINE :\n------\n'
               '%s\n'
               'SPLIT LINE :\n--------------\n'
               '%s\n\n'
               'REORDERED SPLIT LINE :\n-------------------------\n'
               '%s\n\n'
               'EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :\n-------------------------------------------\n'
               'date == %s\n'
               'time == %s\n'
               's_port == %s\n'
               'c_ip == %s\n\n'
               '======================') % (line,'\n'.join(line.split()),line_spliter(line),date,time,s_port,c_ip)




# ---- split each logline into multiple variables, populate dictionaries and db ---- #      
def splitLogline(log_line):
        # needs to be dynamic (for different logging setups)
        s_port,
        time, date,
        s_sitename, s_ip,
        cs_user_agent, sc_status, sc_substatus, sc_win32_status,
        c_ip,
        cs_username,
        cs_method, cs_uri_stem, cs_uri_query = line_spliter(line)

result

REF_FIELDS :
------------
s-port
time
date
s-sitename
s-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
c-ip
cs-username
cs-method
cs-uri-stem
cs-uri-query

line of the fields keywords:
----------------------------
'#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status \n'

list found_fields of keywords in the file:
------------------------------------------
['date', 'time', 's-sitename', 's-ip', 'cs-method', 'cs-uri-stem', 'cs-uri-query', 's-port', 'cs-username', 'c-ip', 'cs(User-Agent)', 'sc-status', 'sc-substatus', 'sc-win32-status']


found_fields == ref_fields  is  False

SPECIAL ORDER
-------------
the_order == [8, 2, 1, 3, 4, 11, 12, 13, 14, 10, 9, 5, 6, 7]


======================
LINE :
------
2010-01-01 00:00:03 SITENAME 192.168.1.1 GET /news-views.aspx - 80 - 66.249.72.135 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) 200 0 0

SPLIT LINE :
--------------
2010-01-01
00:00:03
SITENAME
192.168.1.1
GET
/news-views.aspx
-
80
-
66.249.72.135
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)
200
0
0

REORDERED SPLIT LINE :
-------------------------
('80', '00:00:03', '2010-01-01', 'SITENAME', '192.168.1.1', 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)', '200', '0', '0\n', '66.249.72.135', '-', 'GET', '/news-views.aspx', '-')

EXAMPLE OF SOME CORRECT BINDINGS OBTAINED :
-------------------------------------------
date == 2010-01-01
time == 00:00:03
s_port == 80
c_ip == 66.249.72.135

======================

This code applies only to the case where the fields in a file are shuffled, but in the same number as a normal known list of fields.

It may happen other cases, for example less values in a file than there are known and waited fields. If you need more help for these other cases, explain which cases may happen and I'll try to adapt the code.

I think I will have many remarks to do on the code I rapidly read in spilp.py . I 'll write them when I will have time.

回复收藏 0 原文