使用 Java 扫描器读取文件
我试图理解的 java 文件中的一行如下所示。
return new Scanner(file).useDelimiter("\\Z").next();
根据 java.util.regex.Pattern 文档,该文件预计将返回“输入结束但最终终止符(如果有)”。但实际情况是它仅返回文件中的前 1024 个字符。这是正则表达式模式匹配器施加的限制吗?这可以克服吗?目前我正在使用文件阅读器。但我想知道这种行为的原因。
One of the lines in a java file I'm trying to understand is as below.
return new Scanner(file).useDelimiter("\\Z").next();
The file is expected to return upto "The end of the input but for the final terminator, if any" as per java.util.regex.Pattern documentation. But what happens is it returns only the first 1024 characters from the file. Is this a limitation imposed by the regex Pattern matcher? Can this be overcome? Currently I'm going ahead using a filereader. But I would like to know the reason for this behaviour.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我自己无法重现这一点。但我想我可以阐明正在发生的事情。
在内部,扫描仪使用 1024 个字符的字符缓冲区。如果可能,扫描仪将默认读取您的 Readable 1024 个字符,然后应用该模式。
问题出在你的模式中......它总是与输入的结尾匹配,但这并不意味着你的输入流/数据的结尾。当 Java 将模式应用于缓冲数据时,它会尝试查找输入结尾的第一次出现。由于缓冲区中有 1024 个字符,匹配引擎将位置 1024 称为分隔符的第一个匹配项,并将其之前的所有内容作为第一个标记返回。
由于这个原因,我认为输入结束锚点在扫描仪中使用无效。毕竟,它可以从无限流中读取。
Myself, I couldn't reproduce this. But I think I can shed light as to what is going on.
Internally, the Scanner uses a character buffer of 1024 characters. The Scanner will read from your Readable 1024 characters by default, if possible, and then apply the pattern.
The problem is in your pattern...it will always match the end of the input, but that doesn't mean the end of your input stream/data. When Java applies your pattern to the buffered data, it tries to find the first occurrence of the end of input. Since 1024 characters are in the buffer, the matching engine calls position 1024 the first match of the delimiter and everything before it is returned as the first token.
I don't think the end-of-input anchor is valid for use in the Scanner for that reason. It could be reading from an infinite stream, after all.
尝试将
file
对象包装在FileInputStream
中Try wrapping the
file
object in aFileInputStream
Scanner
旨在从文件中读取多个基元。它实际上并不是要读取整个文件。如果您不想包含第三方库,则最好循环遍历包装
FileReader
/InputStreamReader
的BufferedReader
以获取文本,或循环遍历 FileInputStream 来获取二进制数据。如果您可以使用第三方库,Apache commons-io 有一个
FileUtils
类包含静态方法readFileToString
和readLines
用于文本和readFileToByteArray
用于二进制数据..Scanner
is intended to read multiple primitives from a file. It really isn't intended to read an entire file.If you don't want to include third party libraries, you're better off looping over a
BufferedReader
that wraps aFileReader
/InputStreamReader
for text, or looping over aFileInputStream
for binary data.If you're OK using a third-party library, Apache commons-io has a
FileUtils
class that contains the static methodsreadFileToString
andreadLines
for text andreadFileToByteArray
for binary data..您可以使用 Scanner 类,只需在打开扫描仪时指定一个字符集,即:
Java 使用指定的字符集将从文件读取的字节转换为字符,如果没有给出任何内容,则这是默认的字符集(来自底层操作系统)(< a href="http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html#Scanner%28java.io.File,%20java.lang.String%29" rel ="nofollow">来源)。我仍然不清楚为什么 Scanner 在默认情况下仅读取 1024 字节,而在另一种情况下它会到达文件末尾。不管怎样,效果很好!
You can use the Scanner class, just specify a char-set when opening the scanner, i.e.:
Java converts bytes read from the file into characters using the specified charset, which is the default one (from underlying OS) if nothing is given (source). It is still not clear to me why Scanner reads only 1024 bytes with the default one, whilst with another one it reaches the end of a file. Anyway, it works fine!