使用正则表达式匹配两个特定单词之间的所有内容

发布于 2024-09-16 06:43:54 字数 812 浏览 12 评论 0原文

我正在尝试使用正则表达式解析 Oracle 跟踪文件。我选择的语言是 C#，但我选择使用 Ruby 进行本练习，以熟悉它。

日志文件在某种程度上是可预测的。大多数行（具体来说，99.8%）都符合以下模式：

# [Timestamp]                  [Thread]  [Event]   [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c  (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60  OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/

但是，在日志中的一些地方，有很多复杂的查询，由于某种原因，跨越了几行：

Screenshot

关于这些条目需要指出的两件事是，它们似乎会导致日志文件中出现某种损坏，因为它们以无法打印的字符结尾，然后突然出现下一个条目从同一行开始。

由于这显然排除了逐行捕获数据的可能性，因此我认为下一个最佳选择是匹配单词“TIME：”与“TIME：”的下一个实例或文件末尾之间的所有内容。我不确定如何使用正则表达式来表达这一点。

有更有效的方法吗？我需要解析的日志文件将超过 1.5GB。我的目的是标准化这些行，并删除不必要的行，最终将它们作为行插入数据库中以供查询。

谢谢！

原文

I'm attempting to parse an Oracle trace file using regular expressions. My language of choice is C#, but I chose to use Ruby for this exercise to get some familiarity with it.

The log file is somewhat predictable. Most lines (99.8%, to be specific) match the following pattern:

# [Timestamp]                  [Thread]  [Event]   [Message]
# TIME:2010/08/25-12:00:01:945 TID: a2c  (VERSION) Managed Assembly version: 2.102.2.20
# TIME:2010/08/25-14:00:02:398 TID:1a60  OpsSqlPrepare2(): SELECT * FROM MyTable
line_regex = /^TIME:(\S+)\s+TID:\s*(\S+)\s+(\S+)\s+(.*)$/

However, in a few places in the log there much are complicated queried that, for some reason, span several lines:

Screenshot

Two things to point out about these entries is that they appear to cause some sort of corruption in the log file, because they end with unprintable characters, and then suddenly the next entry begins on the same line.

Since this obviously rules out capturing data on a per-line basis, I think the next best option is to match everything between the word "TIME:" and either the next instance of "TIME:" or the end of the file. I'm not sure how to express this using regular expressions.

Is there a more efficient approach? The log file I need to parse will be over 1.5GB. My intention is to normalize the lines, and drop unnecessary lines, to eventually insert them as rows in a database for querying.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

镜花水月 2024-09-23 06:43:54

匹配“TIME:”和“TIME:”字符串之间或文件末尾之间的潜在多行数据的正则表达式是：

/^TIME:(.+?)(?=TIME:|\z)/im

另一方面，正如 James 提到的，标记“TIME:”子字符串，或查找的子字符串位置“\r\nTIME:”（在第一个“TIME:”条目之后，取决于换行符格式）可能是更好的方法。

The regex to match potentially multi line data between between "TIME:" and "TIME:" strings or the end of the file is:

/^TIME:(.+?)(?=TIME:|\z)/im

On the other hand as James mentions, tokenizing for "TIME:" substrings, or looking for substring positions of "\r\nTIME:" (after the first "TIME:" entry, depending on line-break format) may prove a better approach.

回复收藏 0 原文