java:将文件解压缩为字符串太慢
以下是我如何将字符串压缩到文件中:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
以下是我如何解压缩它:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
解压缩似乎需要很长时间才能完成。我有一种感觉,我在减压方面做了太多多余的步骤。知道如何加快速度吗?
编辑:已根据以下给出的建议将代码修改为上述内容,
1.我改变了模式,所以稍微简单一下我的代码,但是如果我不能使用IOUtils
,那么使用这个模式仍然可以吗?
2. 我按照entonio
的建议将StringBuilder缓冲区设置为2M,是否应该设置得大一点?内存仍然可以,我仍然有大约 10M 可用,正如 Eclipse 的堆监视器所建议的那样 3.我删除了BufferedReader并添加了BufferedInputStream,但我仍然不确定BUFFER_SIZE,有什么建议吗?
上述修改将循环所有 30 个 2M 文件所需的时间从近 30 秒缩短到 14 秒左右,但我需要将其减少到 10 秒以下,在 Android 上是否可能?好吧,基本上,我需要处理一个总共60M的文本文件,我把它们分成了30个2M,在开始处理每个字符串之前,我对时间成本做了上面的计时,只是为了循环所有文件并将文件中的String放入我的内存中。由于我没有太多经验,如果我用60个1M文件代替会更好吗?或者我应该采取任何其他改进?谢谢。
另外:由于物理 IO 非常耗时,并且由于我的文件压缩版本都非常小(大约 2K 到 2M 文本),我是否仍然可以执行上述操作,但在已经映射到的文件上记忆?可能使用java NIO?谢谢
Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils
is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio
, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
BufferedReader
的唯一用途是您不使用的readLine()
方法,那么为什么不直接从InputStreamReader
读取呢?另外,也许减小缓冲区大小可能会有所帮助。此外,您可能应该在读取和写入时指定编码,但这不会对性能产生影响。编辑:更多数据
如果您知道前面字符串的大小,则应该向
decompressRawText
添加长度参数,并使用它来初始化StringBuilder
。否则,它将不断调整大小以适应结果,这是昂贵的。编辑:澄清
2MB 意味着需要大量调整大小。如果您指定的容量大于读取后最终得到的长度(当然,除了暂时使用更多内存),也没有什么坏处。
The
BufferedReader
's only purpose is thereadLine()
method you don't use, so why not just read from theInputStreamReader
? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.edit: more data
If you know the size of the string ahead, you should add a length parameter to
decompressRawText
and use it to initialise theStringBuilder
. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
您应该在使用
GZipInputStream
包装之前使用BufferedInputStream
包装FileInputStream
,而不是使用BufferedReader
。原因是,根据实现的不同,装饰层次结构中的任何各种输入类都可能决定逐字节读取(我想说
InputStreamReader
最有可能这样做)。一旦到达FileInputStream
,这就会转化为许多read(2)
调用。当然,这可能只是我个人的迷信。但是,如果您在 Linux 上运行,则始终可以使用
strace
进行测试。编辑:构建一堆流委托时要遵循的一个很好的模式是使用单个
InputStream
变量。然后,您只需在finally
块中关闭一件事(并且可以使用 Jakarta CommonsIOUtils
来避免大量嵌套的 try-catch-finally 块)。You should wrap the
FileInputStream
with aBufferedInputStream
before wrapping with aGZipInputStream
, rather than using aBufferedReader
.The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the
InputStreamReader
is most likely to do this). And that would translate into manyread(2)
calls once it gets to theFileInputStream
.Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with
strace
.Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single
InputStream
variable. Then, you only have one thing to close in yourfinally
block (and can use Jakarta CommonsIOUtils
to avoid lots of nested try-catch-finally blocks).在 FileInputStream 和 GZIPInputStream 之间添加 BufferedInputStream。
写作的时候也是如此。
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.