如何使用 bash (grep/sed/etc) 获取 2 个时间戳之间的日志文件部分?
我有一组邮件日志: mail.log mail.log.0 mail.log.1.gz mail.log.2.gz
这些文件中的每个文件都包含按时间顺序排序的行,这些行以时间戳开头,例如:
May 3 13 :21:12 ...
如何使用 bash(以及相关的命令行工具)轻松获取某个日期/时间之后和另一个日期/时间之前的每个日志条目,而不比较每个日志条目单线? 请记住,我的之前和之后日期可能与日志文件中的任何条目不完全匹配。
在我看来,我需要确定第一行的偏移量大于开始时间戳,最后一行的偏移量小于结束时间戳,并以某种方式将该部分切掉。
I have a set of mail logs: mail.log mail.log.0 mail.log.1.gz mail.log.2.gz
each of these files contain chronologically sorted lines that begin with timestamps like:
May 3 13:21:12 ...
How can I easily grab every log entry after a certain date/time and before another date/time using bash (and related command line tools) without comparing every single line? Keep in mind that my before and after dates may not exactly match any entries in the logfiles.
It seems to me that I need to determine the offset of the first line greater than the starting timestamp, and the offset of the last line less than the ending timestamp, and cut that section out somehow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
将最小/最大日期转换为“自纪元以来的秒数”,
将每个日志行中的前
n
个单词转换为相同的,比较并丢弃行,直到达到
MIN
,比较并打印行,直到达到
MAX
,超过
MAX
时退出。整个脚本minmaxlog.sh看起来像这样,
我在这个文件minmaxlog.input上运行它,
就像这样,
Convert your min/max dates into "seconds since epoch",
Convert the first
n
words in each log line to the same,Compare and throw away lines until you reach
MIN
,Compare and print lines until you reach
MAX
,Exit when you exceed
MAX
.The whole script minmaxlog.sh looks like this,
I ran it on this file minmaxlog.input,
like this,
这里有一个关于如何执行此操作的基本想法:
我不知道的是:如何最好地读取文件的第 n 行(效率如何使用tail n+**n|head 1**?)
有什么帮助吗?
Here one basic idea of how to do it:
What I don't know is: how to best read the nth line of a file (how efficient is it to use tail n+**n|head 1**?)
Any help?
你必须查看你想要的范围内的每一行(以判断它是否在你想要的范围内),所以我猜你的意思不是文件中的每一行。 至少,您必须查看文件中的每一行,直到并包括超出范围的第一行(我假设这些行按日期/时间顺序排列)。
这是一个相当简单的模式:
您可以用 awk、Perl、Python 甚至 COBOL(如果需要)编写此模式,但逻辑始终相同。
首先找到行号(用 grep ),然后盲目地打印出该行范围不会有帮助,因为 grep 还必须查看所有行(所有,而不仅仅是直到第一次超出范围,并且很可能两次,一次用于第一行,一次用于最后一行)。
如果这是您经常要做的事情,您可能需要考虑将工作量从“每次执行”转移到“文件稳定后一次”。 一个示例是将日志文件行加载到数据库中,并按日期/时间索引。
这需要一段时间来设置,但会导致您的查询变得更快。 我不一定提倡使用数据库 - 您可能可以通过将日志文件拆分为每小时日志来达到相同的效果:
然后,在给定的时间内,您确切地知道从哪里开始和停止查找。 范围
2009/01/01-15:22
到2009/01/05-09:07
将导致:的某些(最后一位) >2009/01/01/1500.txt
。2009/01/01/1[6-9]*.txt
。2009/01/01/2*.txt
。2009/01/0[2-4]/*.txt
。2009/01/05/0[0-8]*.txt
。2009/01/05/0900.txt
的一些(第一位)。当然,我会编写一个脚本来返回这些行,而不是每次都尝试手动执行。
You have to look at every single line in the range you want (to tell if it's in the range you want) so I'm guessing you mean not every line in the file. At a bare minimum, you will have to look at every line in the file up to and including the first one outside your range (I'm assuming the lines are in date/time order).
This is a fairly simple pattern:
You can write this in awk, Perl, Python, even COBOL if you must but the logic is always the same.
Locating the line numbers first (with say grep) and then just blindly printing out that line range won't help since grep also has to look at all the lines (all of them, not just up to the first outside the range, and most likely twice, one for the first line and one for the last).
If this is something you're going to do quite often, you may want to consider shifting the effort from 'every time you do it' to 'once, when the file is stabilized'. An example would be to load up the log file lines into a database, indexed by the date/time.
That takes a while to get set up but will result in your queries becoming a lot faster. I'm not necessarily advocating a database - you could probably achieve the same effect by splitting the log files into hourly logs thus:
Then for a given time, you know exactly where to start and stop looking. The range
2009/01/01-15:22
through2009/01/05-09:07
would result in:2009/01/01/1500.txt
.2009/01/01/1[6-9]*.txt
.2009/01/01/2*.txt
.2009/01/0[2-4]/*.txt
.2009/01/05/0[0-8]*.txt
.2009/01/05/0900.txt
.Of course, I'd write a script to return those lines rather than trying to do it manually each time.
也许你可以尝试这个:
Maybe you can try this:
在 Bash 环境中这可能是可能的,但您确实应该利用具有更多内置支持来处理字符串和日期的工具。 例如,Ruby 似乎具有解析日期格式的内置功能。 然后它可以将其转换为易于比较的 Unix 时间戳(一个正整数,表示自纪元以来的秒数)。
然后,您可以轻松编写 Ruby 脚本:
注意:首先转换为 Unix 时间戳整数很好,因为比较整数非常容易且高效。
您提到“不比较每一行”。 如果不检查之间的所有值,就很难“猜测”日志文件中的条目开始太旧或太新。 但是,如果确实存在单调增加的趋势,那么您立即知道何时停止解析行,因为一旦下一个条目太新(或太旧,取决于数据的布局),您就知道可以停止搜索。 尽管如此,仍然存在找到所需范围内的第一行的问题。
我刚刚注意到你的编辑。 我想说的是:
如果您真的担心如何有效地找到开始和结束条目,那么您可以对每个条目进行二分搜索。 或者,如果这对于 bash 工具来说似乎太过分或太困难,您可以启发式地只读取 5% 的行(每 20 行中就有 1 行),以快速获得接近精确的答案,然后根据需要进行完善。 这些只是性能改进的一些建议。
It may be possible in a Bash environment but you should really take advantage of tools that have more built-in support for working with Strings and Dates. For instance Ruby seems to have the built in ability to parse your Date format. It can then convert it to an easily comparable Unix Timestamp (a positive integer representing the seconds since the epoch).
You can then easily write a Ruby script:
Note: Converting to a Unix Timestamp integer first is nice because comparing integers is very easy and efficient to do.
You mentioned "without comparing every single line." Its going to be hard to "guess" at where in the log file the entries starts being too old, or too new without checking all the values in between. However, if there is indeed a monotonically increasing trend, then you know immediately when to stop parsing lines, because as soon as the next entry is too new (or old, depending on the layout of the data) you know you can stop searching. Still, there is the problem of finding the first line in your desired range.
I just noticed your edit. Here is what I would say:
If you are really worried about efficiently finding that start and end entry, then you could do a binary search for each. Or, if that seems like overkill or too difficult with bash tools you could have a heuristic of reading only 5% of the lines (1 in every 20), to quickly get a close to exact answer and then refining that if desired. These are just some suggestions for performance improvements.
我知道这个线程很旧,但在最近找到满足我需求的单行解决方案后,我偶然发现了它:
在这种情况下,我的文件包含带有逗号分隔值和第一个字段中的时间戳的记录。 您可以使用任何有效的时间戳格式作为开始和结束时间戳,并根据需要替换这些 shell 变量。
如果您想写入新文件,只需使用附加到上面末尾的普通输出重定向(
> newfile
)即可。I know this thread is old, but I just stumbled upon it after recently finding a one line solution for my needs:
In this case, my file has records with comma separated values and the timestamp in the first field. You can use any valid timestamp format for the start and end timestamps, and replace these will shell variables if desired.
If you want to write to a new file, just use normal output redirection (
> newfile
) appended to the end of above.