使用正则表达式匹配两个特定单词之间的所有内容
我正在尝试使用正则表达式解析 Oracle 跟踪文件。我选择的语言是 C#,但我选择使用 Ruby 进行本练习,以熟悉它。
日志文件在某种程度上是可预测的。大多数行(具体来说,99.8%)都符合以下模式:
# [Timestamp] [Thread] [Event] [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60 OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/
但是,在日志中的一些地方,有很多复杂的查询,由于某种原因,跨越了几行:
关于这些条目需要指出的两件事是,它们似乎会导致日志文件中出现某种损坏,因为它们以无法打印的字符结尾,然后突然出现下一个条目从同一行开始。
由于这显然排除了逐行捕获数据的可能性,因此我认为下一个最佳选择是匹配单词“TIME:”与“TIME:”的下一个实例或文件末尾之间的所有内容。我不确定如何使用正则表达式来表达这一点。
有更有效的方法吗?我需要解析的日志文件将超过 1.5GB。我的目的是标准化这些行,并删除不必要的行,最终将它们作为行插入数据库中以供查询。
谢谢!
I'm attempting to parse an Oracle trace file using regular expressions. My language of choice is C#, but I chose to use Ruby for this exercise to get some familiarity with it.
The log file is somewhat predictable. Most lines (99.8%, to be specific) match the following pattern:
# [Timestamp] [Thread] [Event] [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60 OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/
However, in a few places in the log there much are complicated queried that, for some reason, span several lines:
Two things to point out about these entries is that they appear to cause some sort of corruption in the log file, because they end with unprintable characters, and then suddenly the next entry begins on the same line.
Since this obviously rules out capturing data on a per-line basis, I think the next best option is to match everything between the word "TIME:" and either the next instance of "TIME:" or the end of the file. I'm not sure how to express this using regular expressions.
Is there a more efficient approach? The log file I need to parse will be over 1.5GB. My intention is to normalize the lines, and drop unnecessary lines, to eventually insert them as rows in a database for querying.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
匹配“TIME:”和“TIME:”字符串之间或文件末尾之间的潜在多行数据的正则表达式是:
另一方面,正如 James 提到的,标记“TIME:”子字符串,或查找 的子字符串位置“\r\nTIME:”(在第一个“TIME:”条目之后,取决于换行符格式)可能是更好的方法。
The regex to match potentially multi line data between between "TIME:" and "TIME:" strings or the end of the file is:
On the other hand as James mentions, tokenizing for "TIME:" substrings, or looking for substring positions of "\r\nTIME:" (after the first "TIME:" entry, depending on line-break format) may prove a better approach.
这样做可能会更好,即一次一行读取文件...从第一个“TIME”开始,然后连接行直到遇到下一个“TIME”...您可以使用正则表达式可以过滤掉您不想要的任何行。
我不能和鲁比说话;当然,在 C# 中它是一个 StreamReader,它可以帮助您处理文件大小。
It might be better to do this old-school, i.e. read your file in one line at a time... start at the first 'TIME', and concatenate your lines until you hit the next 'TIME'... you can use regular expressions to filter out any lines you don't want.
I can't speak to Ruby; in C# it would be a
StreamReader
, of course, which helps you deal with the file size.