使用 Java 扫描器读取文件

发布于 2024-09-26 01:03:19 字数 251 浏览 9 评论 0原文

我试图理解的 java 文件中的一行如下所示。

return new Scanner(file).useDelimiter("\\Z").next();

根据 java.util.regex.Pattern 文档,该文件预计将返回“输入结束但最终终止符(如果有)”。但实际情况是它仅返回文件中的前 1024 个字符。这是正则表达式模式匹配器施加的限制吗?这可以克服吗?目前我正在使用文件阅读器。但我想知道这种行为的原因。

One of the lines in a java file I'm trying to understand is as below.

return new Scanner(file).useDelimiter("\\Z").next();

The file is expected to return upto "The end of the input but for the final terminator, if any" as per java.util.regex.Pattern documentation. But what happens is it returns only the first 1024 characters from the file. Is this a limitation imposed by the regex Pattern matcher? Can this be overcome? Currently I'm going ahead using a filereader. But I would like to know the reason for this behaviour.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

陌上芳菲 2024-10-03 01:03:19

我自己无法重现这一点。但我想我可以阐明正在发生的事情。

在内部,扫描仪使用 1024 个字符的字符缓冲区。如果可能,扫描仪将默认读取您的 Readable 1024 个字符,然后应用该模式。

问题出在你的模式中......它总是与输入的结尾匹配,但这并不意味着你的输入流/数据的结尾。当 Java 将模式应用于缓冲数据时,它会尝试查找输入结尾的第一次出现。由于缓冲区中有 1024 个字符,匹配引擎将位置 1024 称为分隔符的第一个匹配项,并将其之前的所有内容作为第一个标记返回。

由于这个原因,我认为输入结束锚点在扫描仪中使用无效。毕竟,它可以从无限流中读取。

Myself, I couldn't reproduce this. But I think I can shed light as to what is going on.

Internally, the Scanner uses a character buffer of 1024 characters. The Scanner will read from your Readable 1024 characters by default, if possible, and then apply the pattern.

The problem is in your pattern...it will always match the end of the input, but that doesn't mean the end of your input stream/data. When Java applies your pattern to the buffered data, it tries to find the first occurrence of the end of input. Since 1024 characters are in the buffer, the matching engine calls position 1024 the first match of the delimiter and everything before it is returned as the first token.

I don't think the end-of-input anchor is valid for use in the Scanner for that reason. It could be reading from an infinite stream, after all.

半步萧音过轻尘 2024-10-03 01:03:19

尝试将 file 对象包装在 FileInputStream

Try wrapping the file object in a FileInputStream

禾厶谷欠 2024-10-03 01:03:19

Scanner 旨在从文件中读取多个基元。它实际上并不是要读取整个文件。

如果您不想包含第三方库,则最好循环遍历包装 FileReader/InputStreamReaderBufferedReader 以获取文本,或循环遍历 FileInputStream 来获取二进制数据。

如果您可以使用第三方库,Apache commons-io 有一个 FileUtils 类包含静态方法 readFileToStringreadLines 用于文本和 readFileToByteArray 用于二进制数据..

Scanner is intended to read multiple primitives from a file. It really isn't intended to read an entire file.

If you don't want to include third party libraries, you're better off looping over a BufferedReader that wraps a FileReader/InputStreamReader for text, or looping over a FileInputStream for binary data.

If you're OK using a third-party library, Apache commons-io has a FileUtils class that contains the static methods readFileToString and readLines for text and readFileToByteArray for binary data..

七堇年 2024-10-03 01:03:19

您可以使用 Scanner 类,只需在打开扫描仪时指定一个字符集,即:

Scanner sc = new Scanner(file, "ISO-8859-1");

Java 使用指定的字符集将从文件读取的字节转换为字符,如果没有给出任何内容,则这是默认的字符集(来自底层操作系统)(< a href="http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html#Scanner%28java.io.File,%20java.lang.String%29" rel ="nofollow">来源)。我仍然不清楚为什么 Scanner 在默认情况下仅读取 1024 字节,而在另一种情况下它会到达文件末尾。不管怎样,效果很好!

You can use the Scanner class, just specify a char-set when opening the scanner, i.e.:

Scanner sc = new Scanner(file, "ISO-8859-1");

Java converts bytes read from the file into characters using the specified charset, which is the default one (from underlying OS) if nothing is given (source). It is still not clear to me why Scanner reads only 1024 bytes with the default one, whilst with another one it reaches the end of a file. Anyway, it works fine!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文