Java：使用缓冲输入从随机访问文件中读取字符串

发布于 2024-10-04 23:43:57 字数 1864 浏览 17 评论 0原文

我以前从未近距离接触过 Java IO API，现在我真的很沮丧。我很难相信它是多么奇怪和复杂，而且完成一项简单的任务是多么困难。

我的任务：我有 2 个位置（起始字节、结束字节），pos1 和 pos2。我需要读取这两个字节之间的行（包括起始字节，不包括结束字节）并将它们用作 UTF8 String 对象。

例如，在大多数脚本语言中，这将是一个非常简单的 1-2-3 行（在 Ruby 中，但对于 Python、Perl 等来说本质上是相同的）：

f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
  s = f.readline
  # do something with "s" here
}

Java IO API 很快就会变得地狱；）事实上，我看到两种从常规本地文件读取行（以 \n 结尾）的方法：

RandomAccessFile 有 getFilePointer() 和 seek(long pos)，但它是 readLine() 读取非 UTF8字符串（甚至不是字节数组），但是编码损坏的非常奇怪的字符串，并且它没有缓冲（这可能意味着每个 read*() 调用都会被转换为单个不可靠的操作系统 read() => 相当慢）。
BufferedReader 有很棒的 readLine() 方法，它甚至可以使用 skip(long n) 进行一些查找，但它无法确定已读取的偶数个字节，更不用说当前位置了一个文件。

我尝试使用类似的方法：

    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    BufferedReader br = new BufferedReader(
            new InputStreamReader(
                    fis,
                    CHARSET_UTF8
            )
    );

...然后使用 fc.position() 获取当前文件读取位置，并使用 fc.position(newPosition) 设置一个，但它在我的情况下似乎不起作用：看起来它返回由 BufferedReader 完成的缓冲区预填充的位置，或类似的东西 - 这些计数器似乎以 16K 增量四舍五入。

我真的必须自己实现这一切吗，即一个文件读取接口，它将：

允许我获取/设置文件
缓冲区中的位置文件读取操作
允许读取 UTF8 字符串（或者至少允许“读取所有内容直到下一个”之类的操作\n")

有没有比我自己实现更快的方法？我在监督什么吗？

原文

I've never had close experiences with Java IO API before and I'm really frustrated now. I find it hard to believe how strange and complex it is and how hard it could be to do a simple task.

My task: I have 2 positions (starting byte, ending byte), pos1 and pos2. I need to read lines between these two bytes (including the starting one, not including the ending one) and use them as UTF8 String objects.

For example, in most script languages it would be a very simple 1-2-3-liner like that (in Ruby, but it will be essentially the same for Python, Perl, etc):

f = File.open("file.txt").seek(pos1)
while f.pos < pos2 {
  s = f.readline
  # do something with "s" here
}

It quickly comes hell with Java IO APIs ;) In fact, I see two ways to read lines (ending with \n) from regular local files:

RandomAccessFile has getFilePointer() and seek(long pos), but it's readLine() reads non-UTF8 strings (and even not byte arrays), but very strange strings with broken encoding, and it has no buffering (which probably means that every read*() call would be translated into single undelying OS read() => fairly slow).
BufferedReader has great readLine() method, and it can even do some seeking with skip(long n), but it has no way to determine even number of bytes that has been already read, not mentioning the current position in a file.

I've tried to use something like:

    FileInputStream fis = new FileInputStream(fileName);
    FileChannel fc = fis.getChannel();
    BufferedReader br = new BufferedReader(
            new InputStreamReader(
                    fis,
                    CHARSET_UTF8
            )
    );

... and then using fc.position() to get current file reading position and fc.position(newPosition) to set one, but it doesn't seem to work in my case: looks like it returns position of a buffer pre-filling done by BufferedReader, or something like that - these counters seem to be rounded up in 16K increments.

Do I really have to implement it all by myself, i.e. a file readering interface which would:

allow me to get/set position in a file
buffer file reading operations
allow reading UTF8 strings (or at least allow operations like "read everything till the next \n")

Is there a quicker way than implementing it all myself? Am I overseeing something?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

紫竹語嫣☆ 2024-10-11 23:43:57

import org.apache.commons.io.input.BoundedInputStream

FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
   new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);

如果您不关心 pos2，那么您就不需要 Apache Commons IO。

import org.apache.commons.io.input.BoundedInputStream

FileInputStream file = new FileInputStream(filename);
file.skip(pos1);
BufferedReader br = new BufferedReader(
   new InputStreamReader(new BoundedInputStream(file,pos2-pos1))
);

If you didn't care about pos2, then you woundn't need Apache Commons IO.

回复收藏 0 原文

剑心龙吟 2024-10-11 23:43:57

我编写此代码是为了使用 randomaccessfiles 读取 utf-8

//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);

public CyclicBuffer(FileChannel channel) {
    this.channel = channel;
}

private int read() throws IOException {
    return channel.read(buffer);
}

/**
 * Returns the byte read
 *
 * @return byte read -1 - end of file reached
 * @throws IOException
 */
public byte get() throws IOException {
    if (buffer.hasRemaining()) {
        return buffer.get();
    } else {
        buffer.clear();
        int eof = read();
        if (eof == -1) {
            return (byte) eof;
        }
        buffer.flip();
        return buffer.get();
    }
}
}
//File: UTFRandomFileLineReader.java


public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;

public UTFRandomFileLineReader(FileChannel channel) {
    this.buffer = new CyclicBuffer(channel);
}

public String readLine() throws IOException {
    if (eof) {
        return null;
    }
    byte x = 0;
    temp.clear();

    while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
        if (temp.position() == temp.capacity()) {
            temp = addCapacity(temp);
        }
        temp.put(x);
    }
    if (x == -1) {
        eof = true;
    }
    temp.flip();
    if (temp.hasRemaining()) {
        return charset.decode(temp).toString();
    } else {
        return null;
    }
}

private ByteBuffer addCapacity(ByteBuffer temp) {
    ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
    temp.flip();
    t.put(temp);
    return t;
}

public static void main(String[] args) throws IOException {
    RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
            "r");
    UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
            .getChannel());
    int i = 1;
    while (true) {
        String s = reader.readLine();
        if (s == null)
            break;
        System.out.println("\n line  " + i++);
        s = s + "\n";
        for (byte b : s.getBytes(Charset.forName("utf-8"))) {
            System.out.printf("%x", b);
        }
        System.out.printf("\n");

    }
}
}

I wrote this code to read utf-8 using randomaccessfiles

//File: CyclicBuffer.java
public class CyclicBuffer {
private static final int size = 3;
private FileChannel channel;
private ByteBuffer buffer = ByteBuffer.allocate(size);

public CyclicBuffer(FileChannel channel) {
    this.channel = channel;
}

private int read() throws IOException {
    return channel.read(buffer);
}

/**
 * Returns the byte read
 *
 * @return byte read -1 - end of file reached
 * @throws IOException
 */
public byte get() throws IOException {
    if (buffer.hasRemaining()) {
        return buffer.get();
    } else {
        buffer.clear();
        int eof = read();
        if (eof == -1) {
            return (byte) eof;
        }
        buffer.flip();
        return buffer.get();
    }
}
}
//File: UTFRandomFileLineReader.java


public class UTFRandomFileLineReader {
private final Charset charset = Charset.forName("utf-8");
private CyclicBuffer buffer;
private ByteBuffer temp = ByteBuffer.allocate(4096);
private boolean eof = false;

public UTFRandomFileLineReader(FileChannel channel) {
    this.buffer = new CyclicBuffer(channel);
}

public String readLine() throws IOException {
    if (eof) {
        return null;
    }
    byte x = 0;
    temp.clear();

    while ((byte) -1 != (x = (buffer.get())) && x != '\n') {
        if (temp.position() == temp.capacity()) {
            temp = addCapacity(temp);
        }
        temp.put(x);
    }
    if (x == -1) {
        eof = true;
    }
    temp.flip();
    if (temp.hasRemaining()) {
        return charset.decode(temp).toString();
    } else {
        return null;
    }
}

private ByteBuffer addCapacity(ByteBuffer temp) {
    ByteBuffer t = ByteBuffer.allocate(temp.capacity() + 1024);
    temp.flip();
    t.put(temp);
    return t;
}

public static void main(String[] args) throws IOException {
    RandomAccessFile file = new RandomAccessFile("/Users/sachins/utf8.txt",
            "r");
    UTFRandomFileLineReader reader = new UTFRandomFileLineReader(file
            .getChannel());
    int i = 1;
    while (true) {
        String s = reader.readLine();
        if (s == null)
            break;
        System.out.println("\n line  " + i++);
        s = s + "\n";
        for (byte b : s.getBytes(Charset.forName("utf-8"))) {
            System.out.printf("%x", b);
        }
        System.out.printf("\n");

    }
}
}

回复收藏 0 原文

心在旅行 2024-10-11 23:43:57

@Ken Bloom 快速浏览 Java 7 版本。注意：我认为这不是最有效的方法，我仍在了解 NIO.2，Oracle 已经开始了他们的教程这里

另请注意，这没有使用 Java 7 的新 ARM 语法（它负责基于文件的资源的异常处理），它不是无法在我拥有的最新 openJDK 版本中工作。但如果人们想查看语法，请告诉我。

/* 
 * Paths uses the default file system, note no exception thrown at this stage if 
 * file is missing
 */
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
    /*
     * newByteChannel is a SeekableByteChannel - this is the fun new construct that 
     * supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel 
     * you could read and write to that channel simultaneously with multiple threads.
     */
    fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
    fc.position(startPosition);
    while (fc.read(readBuffer) != -1)
    {
        readBuffer.rewind();
        System.out.println(Charset.forName(encoding).decode(readBuffer));
        readBuffer.flip();
    }
}

For @Ken Bloom A very quick go at a Java 7 version. Note: I don't think this is the most efficient way, I'm still getting my head around NIO.2, Oracle has started their tutorial here

Also note that this isn't using Java 7's new ARM syntax (which takes care of the Exception handling for file based resources), it wasn't working in the latest openJDK build that I have. But if people want to see the syntax, let me know.

/* 
 * Paths uses the default file system, note no exception thrown at this stage if 
 * file is missing
 */
Path file = Paths.get("C:/Projects/timesheet.txt");
ByteBuffer readBuffer = ByteBuffer.allocate(readBufferSize);
FileChannel fc = null;
try
{
    /*
     * newByteChannel is a SeekableByteChannel - this is the fun new construct that 
     * supports asynch file based I/O, e.g. If you declared an AsynchronousFileChannel 
     * you could read and write to that channel simultaneously with multiple threads.
     */
    fc = (FileChannel)file.newByteChannel(StandardOpenOption.READ);
    fc.position(startPosition);
    while (fc.read(readBuffer) != -1)
    {
        readBuffer.rewind();
        System.out.println(Charset.forName(encoding).decode(readBuffer));
        readBuffer.flip();
    }
}

回复收藏 0 原文

薄情伤 2024-10-11 23:43:57

从 RandomAccessFile 开始，然后使用 read 或 readFully 获取 pos1 和 pos2< 之间的字节数组/代码>。假设我们已将读取的数据存储在名为 rawBytes 的变量中。

然后使用创建您的 BufferedReader

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))

然后您可以在 BufferedReader 上调用 readLine。

警告：与让 BufferedReader 本身寻找正确的位置相比，这可能会使用更多的内存，因为它会将所有内容预加载到内存中。

Start with a RandomAccessFile and use read or readFully to get a byte array between pos1 and pos2. Let's say that we've stored the data read in a variable named rawBytes.

Then create your BufferedReader using

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(rawBytes)))

Then you can call readLine on the BufferedReader.

Caveat: this probably uses more memory than if you could make the BufferedReader seek to the right location itself, because it preloads everything into memory.

回复收藏 0 原文

再见回来 2024-10-11 23:43:57

我认为这种混乱是由 UTF-8 编码和双字节字符的可能性引起的。

UTF8 不指定单个字符有多少字节。我从您的帖子中假设您使用的是单字节字符。例如，412 字节意味着 411 个字符。但如果字符串使用双字节字符，您将得到 206 个字符。

原始的 java.io 包没有很好地处理这种多字节混乱。因此，他们添加了更多的类来专门处理字符串。该包混合了两种不同类型的文件处理程序（在术语整理出来之前它们可能会令人困惑）。 stream 类提供直接数据 I/O，无需任何转换。 reader 类将文件转换为完全支持多字节字符的字符串。这可能有助于澄清部分问题。

由于您声明您正在使用 UTF-8 字符，因此您需要阅读器类。在这种情况下，我建议使用 FileReader。 FileReader 中的skip() 方法允许您传递X 个字符，然后开始读取文本。另外，我更喜欢重载的 read() 方法，因为它允许您一次获取所有文本。

如果您假设您的“字节”是单个字符，请尝试如下操作：

FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

I think the confusion is caused by the UTF-8 encoding and the possibility of double byte characters.

UTF8 doesn't specify how many bytes are in a single character. I'm assuming from your post that you are using single byte characters. For example, 412 bytes would mean 411 characters. But if the string were using double byte characters, you would get the 206 character.

The original java.io package didn't deal well with this multi-byte confusion. So, they added more classes to deal specifically with strings. The package mixes two different types of file handlers (and they can be confusing until the nomenclature is sorted out). The stream classes provide for direct data I/O without any conversion. The reader classes convert files to strings with full support for multi-byte characters. That might help clarify part of the problem.

Since you state you are using UTF-8 characters, you want the reader classes. In this case, I suggest FileReader. The skip() method in FileReader allows you to pass by X characters and then start reading text. Alternatively, I prefer the overloaded read() method since it allows you to grab all the text at one time.

If you assume your "bytes" are individual characters, try something like this:

FileReader fr = new FileReader( new File("x.txt") );
char[] buffer = new char[ pos2 - pos ];
fr.read( buffer, pos, buffer.length );
...

回复收藏 0 原文

埖埖迣鎅 2024-10-11 23:43:57

我参加聚会迟到了，但我在自己的项目中遇到了这个问题。

经过大量浏览 Javadocs 和 Stack Overflow，我想我找到了一个简单的解决方案。

在找到 RandomAccessFile 中的适当位置后（我在这里将其称为 raFile），请执行以下操作：

FileDescriptor fd = raFile.getFD();
FileReader     fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);

然后您应该能够调用 br.readLine() 来满足您的需求内容，这将比调用 raFile.readLine() 快得多。

我不确定的一件事是 UTF8 字符串是否被正确处理。

I'm late to the party here, but I ran across this problem in my own project.

After much traversal of Javadocs and Stack Overflow, I think I found a simple solution.

After seeking to the appropriate place in your RandomAccessFile, which I am here calling raFile, do the following:

FileDescriptor fd = raFile.getFD();
FileReader     fr = new FileReader(fd);
BufferedReader br = new BufferedReader(fr);

Then you should be able to call br.readLine() to your heart's content, which will be much faster than calling raFile.readLine().

The one thing I'm not sure about is whether UTF8 strings are handled correctly.

回复收藏 0 原文

皇甫轩 2024-10-11 23:43:57

java IO API 非常灵活。不幸的是，有时灵活性使其变得冗长。这里的主要思想是有许多实现包装器模式的流、编写器和读取器。例如 BufferedInputStream 包装任何其他 InputStream。输出流也是如此。

流和读取器/写入器之间的区别在于流使用字节，而读取器/写入器使用字符。

幸运的是，一些流、写入器和读取器有方便的构造函数来简化编码。如果你想读取文件，你只需说它

    InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
    if (in.markSupported()) {
        in.skip(1024);
        in.read();
    }

并不像你担心的那么复杂。

渠道是不同的。它是所谓的“新 IO”或 nio 的一部分。新的 IO 不会被阻塞——这是它的主要优点。您可以在互联网上搜索任何“nio java 教程”并阅读它。但它比常规 IO 更复杂，并且大多数应用程序不需要。

The java IO API is very flexible. Unfortunately sometimes the flexibility makes it verbose. The main idea here is that there are many streams, writers and readers that implement wrapper patter. For example BufferedInputStream wraps any other InputStream. The same is about output streams.

The difference between streams and readers/writers is that streams work with bytes while readers/writers work with characters.

Fortunately some streams, writers and readers have convenient constructors that simplify coding. If you want to read file you just have to say

    InputStream in = new FileInputStream("/usr/home/me/myfile.txt");
    if (in.markSupported()) {
        in.skip(1024);
        in.read();
    }

It is not so complicated as you afraid.

Channels is something different. It is a part of so called "new IO" or nio. New IO is not blocked - it is its main advantage. You can search in internet for any "nio java tutorial" and read about it. But it is more complicated than regular IO and is not needed for most applications.

回复收藏 0 原文

~没有更多了~