Java:读取一个巨大文件的最后n行
我想读取一个非常大的文件的最后 n 行,而不使用 Java 将整个文件读入任何缓冲区/内存区域。
我查看了 JDK API 和 Apache Commons I/O,但无法找到适合此目的的一个。
我正在考虑 tail 或 less 在 UNIX 中的做法。我不认为他们加载整个文件然后显示文件的最后几行。 Java 中也应该有类似的方法来做同样的事情。
I want to read the last n lines of a very big file without reading the whole file into any buffer/memory area using Java.
I looked around the JDK APIs and Apache Commons I/O and am not able to locate one which is suitable for this purpose.
I was thinking of the way tail or less does it in UNIX. I don't think they load the entire file and then show the last few lines of the file. There should be similar way to do the same in Java too.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
我发现使用
ReversedLinesFileReader
来自 apache commons- io API。此方法将为您提供文件从底部到顶部的行,您可以指定
n_lines
值来指定行数。I found it the simplest way to do by using
ReversedLinesFileReader
from apache commons-io api.This method will give you the line from bottom to top of a file and you can specify
n_lines
value to specify the number of line.如果您使用
RandomAccessFile
< /a>,您可以使用长度
和seek
到达文件末尾附近的特定点,然后从那里向前读取。如果您发现没有足够的行,请从该点返回并重试。一旦您弄清楚最后第 N 行的开始位置,您就可以找到那里并阅读并打印。
可以根据您的数据属性做出初始最佳猜测假设。例如,如果它是一个文本文件,则行长度可能不会超过平均 132,因此,要获取最后五行,请在末尾之前开始 660 个字符。然后,如果你错了,请在 1320 处重试(你甚至可以使用从最后 660 个字符中学到的知识来调整 - 例如:如果这 660 个字符只是三行,则下一次尝试可能是 660 / 3 * 5,加一点额外以防万一)。
If you use a
RandomAccessFile
, you can uselength
andseek
to get to a specific point near the end of the file and then read forward from there.If you find there weren't enough lines, back up from that point and try again. Once you've figured out where the
N
th last line begins, you can seek to there and just read-and-print.An initial best-guess assumption can be made based on your data properties. For example, if it's a text file, it's possible the line lengths won't exceed an average of 132 so, to get the last five lines, start 660 characters before the end. Then, if you were wrong, try again at 1320 (you can even use what you learned from the last 660 characters to adjust that - example: if those 660 characters were just three lines, the next try could be 660 / 3 * 5, plus maybe a bit extra just in case).
正如其他答案所述,RandomAccessFile 是一个很好的起点。但有一个重要警告。
如果您的文件未使用每字符一字节的编码进行编码,则
readLine()
方法对您不起作用。和readUTF()< /code>
在任何情况下都不起作用。 (它读取前面带有字符计数的字符串...)
相反,您需要确保以尊重编码字符边界的方式查找行尾标记。对于固定长度编码(例如UTF-16 或UTF-32 风格),您需要从可被字符大小(以字节为单位)整除的字节位置开始提取字符。对于可变长度编码(例如UTF-8),您需要搜索必须是字符的第一个字节的字节。
在 UTF-8 的情况下,字符的第一个字节将为
0xxxxxxx
或110xxxxx
或1110xxxx
或11110xxx
>。其他任何内容要么是第二个/第三个字节,要么是非法的 UTF-8 序列。请参阅Unicode 标准,版本 5.2,第 3.9 章,表3-7.正如评论讨论所指出的,这意味着正确编码的 UTF-8 流中的任何 0x0A 和 0x0D 字节都将表示 LF 或 CR 字符。因此,如果我们可以假设不使用其他类型的 Unicode 行分隔符(0x2028、0x2029 和 0x0085),那么简单地计算 0x0A 和 0x0D 字节是一种有效的实现策略(对于 UTF-8)。你不能这么假设,那么代码会更复杂。确定了正确的字符边界后,您只需调用 new String(...) 并传递字节数组、偏移量、计数和编码,然后重复调用 String.lastIndexOf(.. .) 来计算行尾。
RandomAccessFile is a good place to start, as described by the other answers. There is one important caveat though.
If your file is not encoded with an one-byte-per-character encoding, the
readLine()
method is not going to work for you. AndreadUTF()
won't work in any circumstances. (It reads a string preceded by a character count ...)Instead, you will need to make sure that you look for end-of-line markers in a way that respects the encoding's character boundaries. For fixed length encodings (e.g. flavors of UTF-16 or UTF-32) you need to extract characters starting from byte positions that are divisible by the character size in bytes. For variable length encodings (e.g. UTF-8), you need to search for a byte that must be the first byte of a character.
In the case of UTF-8, the first byte of a character will be
0xxxxxxx
or110xxxxx
or1110xxxx
or11110xxx
. Anything else is either a second / third byte, or an illegal UTF-8 sequence. See The Unicode Standard, Version 5.2, Chapter 3.9, Table 3-7. This means, as the comment discussion points out, that any 0x0A and 0x0D bytes in a properly encoded UTF-8 stream will represent a LF or CR character. Thus, simply counting the 0x0A and 0x0D bytes is a valid implementation strategy (for UTF-8) if we can assume that the other kinds of Unicode line separator (0x2028, 0x2029 and 0x0085) are not used. You can't assume that, then the code would be more complicated.Having identified a proper character boundary, you can then just call
new String(...)
passing the byte array, offset, count and encoding, and then repeatedly callString.lastIndexOf(...)
to count end-of-lines.ReversedLinesFileReader
可以在 Apache Commons IO java 中找到图书馆。The
ReversedLinesFileReader
can be found in the Apache Commons IO java library.我发现 RandomAccessFile 和其他 Buffer Reader 类对我来说太慢了。没有什么比
tail -<#lines>
更快的了。所以这对我来说是最好的解决方案。I found
RandomAccessFile
and other Buffer Reader classes too slow for me. Nothing can be faster than atail -<#lines>
. So this it was the best solution for me.这是一个不依赖 Apache 的项目。它使用 Java 流,比 RandomAccessFile 或 Apache 的 ReversedLinesFileReader 快得多。以下是我从 100,000 行文件中读取最后 90,000 行时得到的结果:
此方法:50ms
Apache 的 ReversedLinesFileReader:900 毫秒
RandomAccessFile(反向读取):1,200ms
原始来源
Here's one without an Apache dependency. It uses Java streams and is much faster than RandomAccessFile or Apache's ReversedLinesFileReader. Here are the results I got when reading the last 90,000 lines from a 100,000 line file:
This method: 50ms
Apache's ReversedLinesFileReader: 900ms
RandomAccessFile (reading in reverse): 1,200ms
Original source
来自 apache commons 的 CircularFifoBuffer 。类似问题的答案 如何将 .txt 文件的最后 5 行读入 java
请注意,在 Apache Commons Collections 4 中,此类似乎已重命名为 CircularFifoQueue
CircularFifoBuffer from apache commons . answer from a similar question at How to read last 5 lines of a .txt file into java
Note that in Apache Commons Collections 4 this class seems to have been renamed to CircularFifoQueue
RandomAccessFile
允许查找 (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html)。File.length
方法将返回文件的大小。问题是确定行数。为此,您可以查找文件末尾并向后阅读,直到找到正确的行数。A
RandomAccessFile
allows for seeking (http://download.oracle.com/javase/1.4.2/docs/api/java/io/RandomAccessFile.html). TheFile.length
method will return the size of the file. The problem is determining number of lines. For this, you can seek to the end of the file and read backwards until you have hit the right number of lines.我有类似的问题,但我不明白其他解决方案。
我用过这个。我希望那是简单的代码。
I had similar problem, but I don't understood to another solutions.
I used this. I hope thats simple code.
这是为此所做的工作。
Here is the working for this.
这是我发现的最好的方法。简单、相当快、内存效率高。
Here is the best way I've found to do it. Simple and pretty fast and memory efficient.
(参见推荐)
(See commend)
我首先尝试了 RandomAccessFile,向后读取文件并在每次读取操作时重新定位文件指针非常繁琐。因此,我尝试了 @Luca 解决方案,并在几分钟内将文件的最后几行作为字符串以两行形式获取。
I tried RandomAccessFile first and it was tedious to read the file backwards, repositioning the file pointer upon every read operation. So, I tried @Luca solution and I got the last few lines of the file as a string in just two lines in a few minutes.
代码只有 2 行
Gradle:
Maven:
Code is 2 lines only
Gradle:
Maven: