java:将文件解压缩为字符串太慢

发布于 2024-11-07 09:43:43 字数 2147 浏览 0 评论 0原文

以下是我如何将字符串压缩到文件中:

public static void compressRawText(File outFile, String src) {
    FileOutputStream fo = null;
    GZIPOutputStream gz = null;
    try {
        fo = new FileOutputStream(outFile);
        gz = new GZIPOutputStream(fo);
        gz.write(src.getBytes());
        gz.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            gz.close();
            fo.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

以下是我如何解压缩它:

static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
    InputStream in = null;
    InputStreamReader isr = null;
    StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
    try {
        in = new FileInputStream(inFile);
        in = new BufferedInputStream(in, BUFFER_SIZE);
        in = new GZIPInputStream(in, BUFFER_SIZE);
        isr = new InputStreamReader(in);
        char[] cbuf = new char[BUFFER_SIZE];
        int length = 0;
        while ((length = isr.read(cbuf)) != -1) {
            sb.append(cbuf, 0, length);
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            in.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }
    }
    return sb.toString();
}

解压缩似乎需要很长时间才能完成。我有一种感觉,我在减压方面做了太多多余的步骤。知道如何加快速度吗?

编辑:已根据以下给出的建议将代码修改为上述内容,
1.我改变了模式,所以稍微简单一下我的代码,但是如果我不能使用IOUtils,那么使用这个模式仍然可以吗?
2. 我按照entonio的建议将StringBuilder缓冲区设置为2M,是否应该设置得大一点?内存仍然可以,我仍然有大约 10M 可用,正如 Eclipse 的堆监视器所建议的那样 3.我删除了BufferedReader并添加了BufferedInputStream,但我仍然不确定BUFFER_SIZE,有什么建议吗?

上述修改将循环所有 30 个 2M 文件所需的时间从近 30 秒缩短到 14 秒左右,但我需要将其减少到 10 秒以下,在 Android 上是否可能?好吧,基本上,我需要处理一个总共60M的文本文件,我把它们分成了30个2M,在开始处理每个字符串之前,我对时间成本做了上面的计时,只是为了循环所有文件并将文件中的String放入我的内存中。由于我没有太多经验,如果我用60个1M文件代替会更好吗?或者我应该采取任何其他改进?谢谢。

另外:由于物理 IO 非常耗时,并且由于我的文件压缩版本都非常小(大约 2K 到 2M 文本),我是否仍然可以执行上述操作,但在已经映射到的文件上记忆?可能使用java NIO?谢谢

Here is how I compressed the string into a file:

public static void compressRawText(File outFile, String src) {
    FileOutputStream fo = null;
    GZIPOutputStream gz = null;
    try {
        fo = new FileOutputStream(outFile);
        gz = new GZIPOutputStream(fo);
        gz.write(src.getBytes());
        gz.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            gz.close();
            fo.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Here is how I decompressed it:

static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
    InputStream in = null;
    InputStreamReader isr = null;
    StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
    try {
        in = new FileInputStream(inFile);
        in = new BufferedInputStream(in, BUFFER_SIZE);
        in = new GZIPInputStream(in, BUFFER_SIZE);
        isr = new InputStreamReader(in);
        char[] cbuf = new char[BUFFER_SIZE];
        int length = 0;
        while ((length = isr.read(cbuf)) != -1) {
            sb.append(cbuf, 0, length);
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            in.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }
    }
    return sb.toString();
}

The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?

EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?

The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.

ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

著墨染雨君画夕 2024-11-14 09:43:43

BufferedReader 的唯一用途是您不使用的 readLine() 方法,那么为什么不直接从 InputStreamReader 读取呢?另外,也许减小缓冲区大小可能会有所帮助。此外,您可能应该在读取和写入时指定编码,但这不会对性能产生影响。

编辑:更多数据

如果您知道前面字符串的大小,则应该向 decompressRawText 添加长度参数,并使用它来初始化 StringBuilder 。否则,它将不断调整大小以适应结果,这是昂贵的。

编辑:澄清

2MB 意味着需要大量调整大小。如果您指定的容量大于读取后最终得到的长度(当然,除了暂时使用更多内存),也没有什么坏处。

The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.

edit: more data

If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.

edit: clarification

2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).

耀眼的星火 2024-11-14 09:43:43

您应该在使用 GZipInputStream 包装之前使用 BufferedInputStream 包装 FileInputStream,而不是使用 BufferedReader

原因是,根据实现的不同,装饰层次结构中的任何各种输入类都可能决定逐字节读取(我想说 InputStreamReader 最有可能这样做)。一旦到达 FileInputStream,这就会转化为许多 read(2) 调用。

当然,这可能只是我个人的迷信。但是,如果您在 Linux 上运行,则始终可以使用 strace 进行测试。


编辑:构建一堆流委托时要遵循的一个很好的模式是使用单个 InputStream 变量。然后,您只需在 finally 块中关闭一件事(并且可以使用 Jakarta Commons IOUtils 来避免大量嵌套的 try-catch-finally 块)。

  InputStream in = null;
  try
  {
     in = new FileInputStream("foo");
     in = new BufferedInputStream(in);
     in = new GZIPInputStream(in);

     // do something with the stream
  }
  finally
  {
     IOUtils.closeQuietly(in);
  }

You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.

The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.

Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.


Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).

  InputStream in = null;
  try
  {
     in = new FileInputStream("foo");
     in = new BufferedInputStream(in);
     in = new GZIPInputStream(in);

     // do something with the stream
  }
  finally
  {
     IOUtils.closeQuietly(in);
  }
听闻余生 2024-11-14 09:43:43

在 FileInputStream 和 GZIPInputStream 之间添加 BufferedInputStream。

写作的时候也是如此。

Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.

Similarly when writing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文