获取文件的最后n行,类似于tail
我正在为 Web 应用程序编写一个日志文件查看器,并且我想对日志文件的行进行分页。 文件中的项目基于行,最新项目位于底部。
所以我需要一个 tail()
方法,它可以从底部读取 n
行并支持偏移量。 这就是我想到的:
def tail(f, n, offset=0):
"""Reads a n lines from f with an offset of offset lines."""
avg_line_length = 74
to_read = n + offset
while 1:
try:
f.seek(-(avg_line_length * to_read), 2)
except IOError:
# woops. apparently file is smaller than what we want
# to step back, go to the beginning instead
f.seek(0)
pos = f.tell()
lines = f.read().splitlines()
if len(lines) >= to_read or pos == 0:
return lines[-to_read:offset and -offset or None]
avg_line_length *= 1.3
这是一个合理的方法吗? 使用偏移量跟踪日志文件的推荐方法是什么?
I'm writing a log file viewer for a web application and I want to paginate through the lines of the log file. The items in the file are line based with the newest item at the bottom.
So I need a tail()
method that can read n
lines from the bottom and support an offset. This is what I came up with:
def tail(f, n, offset=0):
"""Reads a n lines from f with an offset of offset lines."""
avg_line_length = 74
to_read = n + offset
while 1:
try:
f.seek(-(avg_line_length * to_read), 2)
except IOError:
# woops. apparently file is smaller than what we want
# to step back, go to the beginning instead
f.seek(0)
pos = f.tell()
lines = f.read().splitlines()
if len(lines) >= to_read or pos == 0:
return lines[-to_read:offset and -offset or None]
avg_line_length *= 1.3
Is this a reasonable approach? What is the recommended way to tail log files with offsets?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
这可能比你的更快。 不对线路长度做出任何假设。 一次一个块地返回文件,直到找到正确数量的“\n”字符。
我不喜欢关于行长度的棘手假设,因为实际上,你永远无法知道这样的事情。
一般来说,这将在第一次或第二次循环中找到最后 20 行。 如果你的 74 个字符的东西实际上是准确的,那么你将块大小设置为 2048,你几乎会立即尾随 20 行。
此外,我不会消耗大量大脑卡路里来尝试与物理操作系统块进行巧妙的对齐。 使用这些高级 I/O 包,我怀疑您会看到尝试在操作系统块边界上对齐的任何性能后果。 如果您使用较低级别的 I/O,那么您可能会看到加速。
更新
对于 Python 3.2 及更高版本,请按照字节处理进行操作 在文本文件中(模式字符串中没有 "b" 打开的文件),仅查找相对值允许到文件开头(例外情况是使用seek(0, 2) 查找到文件末尾)。:
例如:
f = open('C:/.../. ./apache_logs.txt', 'rb')
This may be quicker than yours. Makes no assumptions about line length. Backs through the file one block at a time till it's found the right number of '\n' characters.
I don't like tricky assumptions about line length when -- as a practical matter -- you can never know things like that.
Generally, this will locate the last 20 lines on the first or second pass through the loop. If your 74 character thing is actually accurate, you make the block size 2048 and you'll tail 20 lines almost immediately.
Also, I don't burn a lot of brain calories trying to finesse alignment with physical OS blocks. Using these high-level I/O packages, I doubt you'll see any performance consequence of trying to align on OS block boundaries. If you use lower-level I/O, then you might see a speedup.
UPDATE
for Python 3.2 and up, follow the process on bytes as In text files (those opened without a "b" in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)).:
eg:
f = open('C:/.../../apache_logs.txt', 'rb')
假设 Python 2 上有一个类 UNIX 系统,您可以执行以下操作:
对于 python 3,您可以执行以下操作:
Assumes a unix-like system on Python 2 you can do:
For python 3 you may do:
这是我的答案。 纯蟒蛇。 使用 timeit 看起来相当快。 对包含 100,000 行的日志文件进行尾部 100 行:
代码如下:
Here is my answer. Pure python. Using timeit it seems pretty fast. Tailing 100 lines of a log file that has 100,000 lines:
Here is the code:
如果可以接受读取整个文件,则使用双端队列。
在 2.6 之前,deques 没有 maxlen 选项,但它很容易实现。
如果需要从末尾读取文件,则使用快速(也称为指数)搜索。
If reading the whole file is acceptable then use a deque.
Prior to 2.6, deques didn't have a maxlen option, but it's easy enough to implement.
If it's a requirement to read the file from the end, then use a gallop (a.k.a exponential) search.
S.Lott 上面的答案几乎对我有用,但最终给了我部分台词。 事实证明,它破坏了块边界上的数据,因为数据以相反的顺序保存读取的块。 当调用 ''.join(data) 时,块的顺序是错误的。 这解决了这个问题。
S.Lott's answer above almost works for me but ends up giving me partial lines. It turns out that it corrupts data on block boundaries because data holds the read blocks in reversed order. When ''.join(data) is called, the blocks are in the wrong order. This fixes that.
我最终使用的代码。 我认为这是迄今为止最好的:
The code I ended up using. I think this is the best so far:
使用 mmap 简单快速的解决方案:
Simple and fast solution with mmap:
将 @papercrane 解决方案更新为 python3。
使用
open(filename, 'rb')
打开文件并:Update @papercrane solution to python3.
Open the file with
open(filename, 'rb')
and:最简单的方法是使用
deque
:The simplest way is to use
deque
:应评论者的要求发布答案我对类似问题的回答,其中使用相同的技术来改变文件的最后一行,而不仅仅是获取它。
对于较大大小的文件,
mmap
为做到这一点的最好方法。 为了改进现有的mmap
答案,这个版本可以在 Windows 和 Linux 之间移植,并且运行速度应该更快(尽管如果不进行一些修改,它就无法在 32 位 Python 上使用 GB 范围内的文件运行,请参阅有关处理此问题的提示以及修改以在 Python 2 上工作的提示的其他答案)。这假设尾部的行数足够小,您可以安全地将它们一次性全部读入内存; 您还可以将其设为生成器函数,并通过将最后一行替换为以下内容来一次手动读取一行:
最后,以二进制模式读取(需要使用
mmap
),因此它给出str
行 (Py2) 和bytes
行 (Py3); 如果您想要unicode
(Py2) 或str
(Py3),可以调整迭代方法来为您解码和/或修复换行符:注意:我把这一切都输入了在我无法访问 Python 进行测试的机器上。 如果我输入了什么内容,请告诉我; 这与我的其他答案非常相似,我认为它应该有效,但是需要进行一些调整(例如处理
偏移量
)可能会导致微妙的错误。 如果有任何错误,请在评论中告诉我。Posting an answer at the behest of commenters on my answer to a similar question where the same technique was used to mutate the last line of a file, not just get it.
For a file of significant size,
mmap
is the best way to do this. To improve on the existingmmap
answer, this version is portable between Windows and Linux, and should run faster (though it won't work without some modifications on 32 bit Python with files in the GB range, see the other answer for hints on handling this, and for modifying to work on Python 2).This assumes the number of lines tailed is small enough you can safely read them all into memory at once; you could also make this a generator function and manually read a line at a time by replacing the final line with:
Lastly, this read in binary mode (necessary to use
mmap
) so it givesstr
lines (Py2) andbytes
lines (Py3); if you wantunicode
(Py2) orstr
(Py3), the iterative approach could be tweaked to decode for you and/or fix newlines:Note: I typed this all up on a machine where I lack access to Python to test. Please let me know if I typoed anything; this was similar enough to my other answer that I think it should work, but the tweaks (e.g. handling an
offset
) could lead to subtle errors. Please let me know in the comments if there are any mistakes.一个更干净的 python3 兼容版本,不插入但附加 & 反转:
像这样使用它:
An even cleaner python3 compatible version that doesn't insert but appends & reverses:
use it like this:
简单的 :
Simple :
基于 S.Lott 的最高投票答案(2008 年 9 月 25 日 21:43),但针对小文件进行了修复。
希望这有用。
based on S.Lott's top voted answer (Sep 25 '08 at 21:43), but fixed for small files.
Hope this is useful.
pypi 上有一些现有的 tail 实现,您可以使用 pip 安装:
根据您的情况,使用这些现有工具之一可能会有优势。
There are some existing implementations of tail on pypi which you can install using pip:
Depending on your situation, there may be advantages to using one of these existing tools.
有一个非常有用的 模块 可以做到这一点:
There is very useful module that can do this:
我发现上面的 Popen 是最好的解决方案。 它又快又脏而且有效
对于 Unix 机器上的 python 2.6,我使用以下
soutput 将包含代码的最后 n 行。 要逐行迭代 soutput,请执行以下操作:
I found the Popen above to be the best solution. It's quick and dirty and it works
For python 2.6 on Unix machine i used the following
soutput will have will contain last n lines of the code. to iterate through soutput line by line do:
为了提高非常大文件的效率(在您可能想要使用 tail 的日志文件情况下很常见),您通常希望避免读取整个文件(即使您这样做时没有立即将整个文件读入内存)但是,您这样做需要以某种方式计算出行而不是字符的偏移量。 一种可能是使用seek()逐个字符地向后读取,但这非常慢。 相反,最好以更大的块进行处理。
我不久前编写了一个实用函数来向后读取文件,可以在此处使用。
[编辑]添加了更具体的版本(避免需要反转两次)
For efficiency with very large files (common in logfile situations where you may want to use tail), you generally want to avoid reading the whole file (even if you do do it without reading the whole file into memory at once) However, you do need to somehow work out the offset in lines rather than characters. One possibility is reading backwards with seek() char by char, but this is very slow. Instead, its better to process in larger blocks.
I've a utility function I wrote a while ago to read files backwards that can be used here.
[Edit] Added more specific version (avoids need to reverse twice)
您可以使用 f.seek(0, 2) 转到文件末尾,然后使用以下 readline() 替换来逐行读取:
you can go to the end of your file with f.seek(0, 2) and then read off lines one by one with the following replacement for readline():
基于 Eyecue 答案(2010 年 6 月 10 日 21:28):此类将 head() 和 tail() 方法添加到文件对象。
用法:
Based on Eyecue answer (Jun 10 '10 at 21:28): this class add head() and tail() method to file object.
Usage:
如果文件不以 \n 结尾或确保读取完整的第一行,其中一些解决方案会出现问题。
Several of these solutions have issues if the file doesn't end in \n or in ensuring the complete first line is read.
我必须从文件的最后一行读取特定值,并偶然发现了这个线程。 我没有用 Python 重新发明轮子,而是得到了一个很小的 shell 脚本,另存为
/usr/local/bin/get_last_netp:
在Python程序中:
I had to read a specific value from the last line of a file, and stumbled upon this thread. Rather than reinventing the wheel in Python, I ended up with a tiny shell script, saved as
/usr/local/bin/get_last_netp:
And in the Python program:
这不是第一个使用双端队列的示例,而是一个更简单的示例。 这是通用的:它适用于任何可迭代对象,而不仅仅是文件。
Not the first example using a deque, but a simpler one. This one is general: it works on any iterable object, not just a file.
这是一个非常简单的实现:
Here is a pretty simple implementation:
更新 A.Coady 给出的答案
适用于 python 3。
这使用指数搜索并且只会缓冲前后的
N
行非常高效。Update for answer given by A.Coady
Works with python 3.
This uses Exponential Search and will buffer only
N
lines from back and is very efficient.另一种解决方案
如果您的 txt 文件如下所示,则 :
老鼠
蛇
猫
蜥蜴
狼
狗
你可以通过简单地使用Python中的数组索引来反转这个文件
'''
结果:
狗
狼
蜥蜴
猫
Another Solution
if your txt file looks like this:
mouse
snake
cat
lizard
wolf
dog
you could reverse this file by simply using array indexing in python
'''
result:
dog
wolf
lizard
cat
出色地! 我遇到了类似的问题,尽管我只需要仅最后一行,
所以我想出了自己的解决方案
这个函数返回文件中的最后一个字符串
我有一个 1.27gb 的日志文件,找到最后一行花费的时间非常少(甚至不到半秒)
Well! I had a similar problem, though I only required LAST LINE ONLY,
so I came up with my own solution
This function return last string in a file
I have a log file of 1.27gb and it took very very less time to find the last line (not even half a second)