Java:使用缓冲输入从随机访问文件中读取字符串
我以前从未近距离接触过 Java IO API,现在我真的很沮丧。我很难相信它是多么奇怪和复杂,而且完成一项简单的任务是多么困难。
我的任务:我有 2 个位置(起始字节、结束字节),pos1
和 pos2
。我需要读取这两个字节之间的行(包括起始字节,不包括结束字节)并将它们用作 UTF8 String 对象。
例如,在大多数脚本语言中,这将是一个非常简单的 1-2-3 行(在 Ruby 中,但对于 Python、Perl 等来说本质上是相同的):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
Java IO API 很快就会变得地狱; )事实上,我看到两种从常规本地文件读取行(以 \n
结尾)的方法:
- RandomAccessFile 有
getFilePointer()
和seek(long pos)
,但它是 readLine() 读取非 UTF8字符串(甚至不是字节数组),但是编码损坏的非常奇怪的字符串,并且它没有缓冲(这可能意味着每个read*()
调用都会被转换为单个不可靠的操作系统read()
=> 相当慢)。 - BufferedReader 有很棒的
readLine()
方法,它甚至可以使用skip(long n)
进行一些查找,但它无法确定已读取的偶数个字节,更不用说当前位置了一个文件。
我尝试使用类似的方法:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
...然后使用 fc.position()
获取当前文件读取位置,并使用 fc.position(newPosition)
设置一个,但它在我的情况下似乎不起作用:看起来它返回由 BufferedReader 完成的缓冲区预填充的位置,或类似的东西 - 这些计数器似乎以 16K 增量四舍五入。
我真的必须自己实现这一切吗,即一个文件读取接口,它将:
- 允许我获取/设置文件
- 缓冲区中的位置 文件读取操作
- 允许读取 UTF8 字符串(或者至少允许“读取所有内容直到下一个”之类的操作
\n
")
有没有比我自己实现更快的方法?我在监督什么吗?
I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.
My task: I have 2 positions (starting byte, ending byte), pos1
and pos2
. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.
For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):
f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
s = f.readline
# do something with "s" here
}
It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n
) from regular local files:
- RandomAccessFile has
getFilePointer()
andseek(long pos)
, but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that everyread*()
call would be translated into single undelying OSread()
=> fairly slow). - BufferedReader has great
readLine()
method, and it can even do some seeking withskip(long n)
, but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.
I've tried to use something like:
FileInputStream fis = new FileInputStream(fileName);
FileChannel fc = fis.getChannel();
BufferedReader br = new BufferedReader(
new InputStreamReader(
fis,
CHARSET_UTF8
)
);
... and then using fc.position()
to get current file reading position and fc.position(newPosition)
to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.
Do I really have to implement it all by myself, i.e. a file readering interface which would:
- allow me to get/set position in a file
- buffer file reading operations
- allow reading UTF8 strings (or at least allow operations like "read everything till the next
\n
")
Is there a quicker way than implementing it all myself? Am I overseeing something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您不关心 pos2,那么您就不需要 Apache Commons IO。
If you didn't care about
pos2
, then you woundn't need Apache Commons IO.我编写此代码是为了使用 randomaccessfiles 读取 utf-8
I wrote this code to read utf-8 using randomaccessfiles
@Ken Bloom 快速浏览 Java 7 版本。注意:我认为这不是最有效的方法,我仍在了解 NIO.2,Oracle 已经开始了他们的教程 这里
另请注意,这没有使用 Java 7 的新 ARM 语法(它负责基于文件的资源的异常处理),它不是无法在我拥有的最新 openJDK 版本中工作。但如果人们想查看语法,请告诉我。
For @Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here
Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.
从
RandomAccessFile
开始,然后使用read
或readFully
获取pos1
和pos2< 之间的字节数组/代码>。假设我们已将读取的数据存储在名为
rawBytes
的变量中。然后使用创建您的
BufferedReader
然后您可以在
BufferedReader
上调用readLine
。警告:与让 BufferedReader 本身寻找正确的位置相比,这可能会使用更多的内存,因为它会将所有内容预加载到内存中。
Start with a
RandomAccessFile
and useread
orreadFully
to get a byte array betweenpos1
andpos2
. Let's say that we've stored the data read in a variable namedrawBytes
.Then create your
BufferedReader
usingThen you can call
readLine
on theBufferedReader
.Caveat: this probably uses more memory than if you could make the
BufferedReader
seek to the right location itself, because it preloads everything into memory.我认为这种混乱是由 UTF-8 编码和双字节字符的可能性引起的。
UTF8 不指定单个字符有多少字节。我从您的帖子中假设您使用的是单字节字符。例如,412 字节意味着 411 个字符。但如果字符串使用双字节字符,您将得到 206 个字符。
原始的 java.io 包没有很好地处理这种多字节混乱。因此,他们添加了更多的类来专门处理字符串。该包混合了两种不同类型的文件处理程序(在术语整理出来之前它们可能会令人困惑)。 stream 类提供直接数据 I/O,无需任何转换。 reader 类将文件转换为完全支持多字节字符的字符串。这可能有助于澄清部分问题。
由于您声明您正在使用 UTF-8 字符,因此您需要阅读器类。在这种情况下,我建议使用 FileReader。 FileReader 中的skip() 方法允许您传递X 个字符,然后开始读取文本。另外,我更喜欢重载的 read() 方法,因为它允许您一次获取所有文本。
如果您假设您的“字节”是单个字符,请尝试如下操作:
I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.
UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.
The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.
Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.
If you assume your "bytes" are individual characters, try something like this:
我参加聚会迟到了,但我在自己的项目中遇到了这个问题。
经过大量浏览 Javadocs 和 Stack Overflow,我想我找到了一个简单的解决方案。
在找到 RandomAccessFile 中的适当位置后(我在这里将其称为
raFile
),请执行以下操作:然后您应该能够调用
br.readLine()
来满足您的需求内容,这将比调用 raFile.readLine() 快得多。我不确定的一件事是 UTF8 字符串是否被正确处理。
I'm late to the party here, but I ran across this problem in my own project.
After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.
After seeking to the appropriate place in your RandomAccessFile, which I am here calling
raFile
, do the following:Then you should be able to call
br.readLine()
to your heart's content, which will be much faster than callingraFile.readLine()
.The one thing I'm not sure about is whether UTF8 strings are handled correctly.
java IO API 非常灵活。不幸的是,有时灵活性使其变得冗长。这里的主要思想是有许多实现包装器模式的流、编写器和读取器。例如 BufferedInputStream 包装任何其他 InputStream。输出流也是如此。
流和读取器/写入器之间的区别在于流使用字节,而读取器/写入器使用字符。
幸运的是,一些流、写入器和读取器有方便的构造函数来简化编码。如果你想读取文件,你只需说它
并不像你担心的那么复杂。
渠道是不同的。它是所谓的“新 IO”或 nio 的一部分。新的 IO 不会被阻塞——这是它的主要优点。您可以在互联网上搜索任何“nio java 教程”并阅读它。但它比常规 IO 更复杂,并且大多数应用程序不需要。
The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.
The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.
Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say
It is not so complicated as you afraid.
Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.