创建动态的“可变布局” Python 中的 split()
我有一个解析 IIS 日志的脚本,目前它使用 split 将 IIS 字段值放入多个变量中逐一获取日志行,如下所示:
date, time, sitename, ip, uri_stem, whatever = log_line.split(" ")
而且,它对于默认设置运行良好。但是,如果其他人使用不同的日志字段布局(不同的顺序或不同的日志字段,或两者兼而有之),他将必须在源代码中找到此行并进行修改。这个人还必须知道如何修改它,以便不会出现任何中断,因为显然这个变量稍后会在代码中使用。
我如何才能使这个更通用,以某种方式包含某种列表,其中包含用户可以修改的 IIS 日志字段布局(配置变量,或脚本开头的字典/列表),该列表稍后将被修改用于保存日志行值?这就是我所认为的——“动态”。我正在考虑也许使用 for 循环和字典来做到这一点,但我想与使用 split() 相比,它会对性能产生很大的影响,或者不是吗?有人对如何/应该如何做到这一点有建议吗?
是否值得这么麻烦,或者我应该为使用该脚本的任何人记下在哪里更改包含 log_line.split() 的代码行、如何操作以及要注意什么?
谢谢。
I have a script that parses IIS logs and at the moment it fetches log lines one by one using split to put IIS field values into multiple variables like this:
date, time, sitename, ip, uri_stem, whatever = log_line.split(" ")
And, it works fine for the default setup. But, if someone else uses a different log field layout (different order, or different log fields, or both) he would have to go and find this line in the source and modify it. This person would also have to know how to modify it so that nothing breaks since obviously this variables are used later in the code.
How could I make this more generic in a way of having some kind of a list that would contain IIS log field layout which a user could modify (a config variable, or a dict/list at the beginning of the script) that would later be used to hold log line values? That is what I consider - "dynamic". I was thinking of maybe using for-loop and a dictionary to do that, but I imagine it would have a big impact on performance compared to using split(), or wouldn't it? Does anyone have a suggestion on how this could/should be done?
Is it even worth the trouble, or should I just make a note for anyone that uses the script on where to change the code line that contains log_line.split(), how to do it and what to pay attention to?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果仅字段的顺序可能不同,则可以处理每行的验证并自动调整信息的提取以适应检测到的顺序。
我认为在正则表达式的帮助下很容易做到这一点。
如果不仅顺序不同,字段的数量和性质也可能不同,我认为仍然可以做同样的事情,但前提是提前知道可能的字段。
常见的条件是,这些字段必须具有足够强大的“个性”,以便易于区分。
如果没有更精确的信息,没有人可以走得更远,IMO
,8 月 15 日星期一 9:39 GMT+0:00
似乎有一个错误 < em>spilp.py:
它必须是
with codecs.open(file_path, 'r',encoding='utf-8',errors='ignore') as log_lines:
不
with open(file_path, 'r',encoding='utf-8',errors='ignore') 作为 log_lines:
后者使用内置 open() ,它没有相关关键字
Monday, 15 August 16:10 GMT+0:00
目前,在示例文件中,字段按以下顺序排列:
。
假设您想按以下顺序提取每行的值:
以相同的顺序将它们分配给以下标识符:
使用
函数 line_spliter()
我知道,我知道,您想要的是相反的:按照文件当前的顺序恢复文件中读取的值,以防万一与当前通用顺序不同的文件。
但我仅以此为例,目的是让示例文件保持原样。否则,我需要创建一个具有不同顺序值的其他文件来公开示例。
无论如何,算法不依赖于示例。它取决于定义执行正确分配所必须获得的值的连续性所需的顺序。
在我的代码中,这个所需的顺序是用对象 ref_fields 设置的,
我认为我的代码及其执行本身就可以理解原理。
结果
此代码仅适用于文件中的字段被打乱的情况,但其数量与正常的已知字段列表相同。
可能会发生其他情况,例如文件中的值少于已知和等待的字段。如果您需要针对这些其他情况的更多帮助,请解释可能会发生哪些情况,我将尝试调整代码。
。
我想我会对我在 spilp.py 中快速阅读的代码做很多评论。当我有时间的时候我会写它们。
If only the order of the fields may vary, it is possible to process a verification of each line and to automatically adapt the extraction of information to the detected order.
I think it would be easy to do so with the help of regex.
If not only the order, but the number and nature of fields may vary, I think it would still be possible to do the same, but at the condition to know in advance the possible fields.
And the common condition is that the fields must have "personalities" strong enough to be easily distinguishable
Without more precise information, nobody can go further, IMO
Monday, 15 August 9:39 GMT+0:00
It seems there is an error in spilp.py :
it must be
with codecs.open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
not
with open(file_path, 'r', encoding='utf-8', errors='ignore') as log_lines:
The latter uses the builtin open() which has not the keywords in question
Monday, 15 August 16:10 GMT+0:00
Presently , in the sample file, the fields are in this order:
.
Suppose you want to extract the values of each line in the following order:
to assign them to the following identifiers in the same order:
doing
with a function line_spliter()
I know, I know, what you want is the contrary: to restore the values read in a file in the order they have presently is the file, in case there is a file with a different order than the generic present one.
But I take this only as example, in the aim to let the sample file as is. Otherwise I would need to create an other file with different order of values to expose an example.
Anyway, the algorithm doesn't depend of the example. It depends of the desired order in which one defines succession of the values that must be obtained to do a correct assignement.
In my code , this desired order is set with the object ref_fields
I think that my code and its execution speak themselves to make understand the principle.
result
This code applies only to the case where the fields in a file are shuffled, but in the same number as a normal known list of fields.
It may happen other cases, for example less values in a file than there are known and waited fields. If you need more help for these other cases, explain which cases may happen and I'll try to adapt the code.
.
I think I will have many remarks to do on the code I rapidly read in spilp.py . I 'll write them when I will have time.
更改日志行布局是一件相当大的事情,但由于添加了新项目或删除了现有项目,因此有时会发生一些事情。很少有人只是为了好玩而随意摆弄现有的物品。
这些变化并不是每天都会发生。它们应该非常罕见。当您在日志行中添加或删除项目时,无论如何您都会更改代码 - 毕竟,必须以某种方式处理新字段,并且必须删除处理任何已删除字段的代码。
是的,编写有弹性的代码是一件好事。定义一个将字段名称映射到其在日志行中的位置的模式似乎是一个好主意,因为它允许重新排列和添加,而无需深入研究一条分割线。但是对于每年发生两次的架构更改值得吗?当无论如何都必须更改许多其他行时,是否值得阻止其中一条行的更改?这由你决定。
也就是说,如果您想要这样做,请考虑使用 collections.namedtuple 将您的行处理成类似字典的对象。名称的规范可以在代码的配置区域中完成。这样做会降低性能,因此请权衡灵活性的提高......
Changing the log line layout is a pretty big deal, but something that gets done from time to time because new items are added or existing items deleted. Rarely does someone simply shuffle around existing items just for the heck of it.
These kinds of changes do not happen every day; they should be pretty rare. And when you have items within the logline added or deleted, the you are changing the code anyway -- after all, the new fields have to be processed in some way, and the code to process any deleted fields have to be removed.
Yes, writing resilient code is a Good Thing. Defining a schema mapping field names to their positions in the log line may seem like a great idea since it permits reshuffling and adding with out digging into the one split line. But is it worth it for schema changes that happen twice a year? And is it worth it to prevent the change of one line when so many other lines will have to be changed anyway? That is for you to decide.
That said, if you want to do this, consider using collections.namedtuple to process your line into a dict-like object. The specification of the names can be done in a configuration-area of your code. You will take a performance hit in doing so, so weigh that against the gain in flexibility....