如何让 Java 将我的多核处理器与 GZIPInputStream 结合使用?
我在我的程序中使用 GZIPInputStream,并且我知道如果我能让 Java 并行运行我的程序,性能将会有所帮助。
一般来说,标准虚拟机是否有命令行选项可以在多个核心上运行?它只在一台机器上运行。
谢谢!
编辑
我在 Windows XP 上运行普通的 Java SE 6 update 17。
将 GZIPInputStream 放在单独的线程上会显式帮助吗? 不!不要将 GZIPInputStream 放在单独的线程上!不要多线程 I/O!
编辑2
我认为I/O是瓶颈,因为我正在读取和写入同一个磁盘...
不过,一般来说,有没有办法让GZIPInputStream更快?或者并行运行的 GZIPInputStream 的替代品?
编辑3 我使用的代码片段:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
I'm using a GZIPInputStream in my program, and I know that the performance would be helped if I could get Java running my program in parallel.
In general, is there a command-line option for the standard VM to run on many cores? It's running on just one as it is.
Thanks!
Edit
I'm running plain ol' Java SE 6 update 17 on Windows XP.
Would putting the GZIPInputStream on a separate thread explicitly help? No! Do not put the GZIPInputStream on a separate thread! Do NOT multithread I/O!
Edit 2
I suppose I/O is the bottleneck, as I'm reading and writing to the same disk...
In general, though, is there a way to make GZIPInputStream faster? Or a replacement for GZIPInputStream that runs parallel?
Edit 3
Code snippet I used:
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(INPUT_FILENAME));
DataInputStream in = new DataInputStream(new BufferedInputStream(gzip));
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
据我所知,从该流中读取的操作是单线程的,因此如果您正在读取一个文件,多个 CPU 将无法帮助您。
但是,您可以有多个线程,每个线程解压缩不同的文件。
话虽这么说,如今解压缩并不是特别需要大量计算,您更有可能因 IO 成本而受阻(例如,如果您正在硬盘的两个不同区域中读取两个非常大的文件)。
更一般地说(假设这是 Java 新手的问题),Java 不会为您并行执行操作。您必须使用线程告诉它您想要执行的工作单元是什么以及如何在它们之间同步。 Java(在操作系统的帮助下)通常会占用尽可能多的可用内核,并且如果线程多于内核(通常是这种情况),也会在同一内核上交换线程。
AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).
PIGZ = GZip 的并行实现是 gzip 的全功能替代品,在压缩数据时充分利用多个处理器和多个内核。 http://www.zlib.net/pigz/ 它还不是 Java——任何接受者。当然,世界需要 Java。
有时,压缩或解压缩会消耗大量 CPU,但它可以帮助 I/O 不再成为瓶颈。
另请参阅 HP 实验室的数据系列 (C++)。 PIGZ 仅并行化压缩,而 Dataseries 将输出分解为大的压缩块,这些压缩块可以并行解压缩。还具有许多其他功能。
PIGZ = Parallel Implementation of GZip is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. http://www.zlib.net/pigz/ It's not Java yet--- any takers. Of course the world needs it in Java.
Sometimes the compression or decompression is a big CPU-consumer, though it helps the I/O not be the bottleneck.
See also Dataseries (C++) from HP Labs. PIGZ only parallelizes the compression, while Dataseries breaks the output into large compressed blocks, which are decompressible in parallel. Also has a number of other features.
将 GZIP 流包装在缓冲流中,这应该会给您带来显着的性能提升。
对于输入流也是如此。使用缓冲的输入/输出流可以减少磁盘读取的次数。
Wrap your GZIP streams in Buffered streams, this should give you a significant performance increase.
And likewise for the input stream. Using the buffered input/output streams reduces the number of disk reads.
我没有看到任何解决程序的其他处理的答案。
如果您只是解压缩文件,最好使用命令行
gunzip
工具;但可能会对您从该流中提取的文件进行一些处理。如果您要提取大小合理的块中的内容,那么对这些块的处理应该在与解压缩不同的线程中进行。
您可以在每个大字符串或其他数据块上手动启动一个线程;但从 Java 1.6 左右开始,您最好使用 java.util.concurrent 中的一个新奇类,例如 ThreadPoolExecutor。
更新
从问题和其他评论中我不清楚您是否真的只是使用 Java 提取文件。如果您真的、真的认为您应该尝试与 Gunzip 竞争,那么您可能可以通过使用大缓冲区来获得一些性能;例如,使用 10 MB(二进制,不是十进制!- 1048576)的缓冲区,将其填充到单个文件中并以同样的方式将其写入磁盘。这将使您的操作系统有机会对磁盘空间进行一些中等规模的规划,并且您也将需要更少的系统级调用。
I'm not seeing any answer addressing the other processing of your program.
If you're just unzipping a file, you'd be better off simply using the command line
gunzip
tool; but likely there's some processing happening with the files you're pulling out of that stream.If you're extracting something that comes in reasonably sized chunks, then your processing of those chunks should be happening in a separate thread from the unzipping.
You could manually start a Thread on each large String or other block of data; but since Java 1.6 or so you'd be better of with one of the fancy new classes in
java.util.concurrent
, such as aThreadPoolExecutor
.Update
It's not clear to me from the question and other comments whether you really ARE just extracting files using Java. If you really, really think you should try to compete with
gunzip
, then you can probably gain some performance by using large buffers; i.e. work with a buffer of, say, 10 MB (binary, not decimal! - 1048576), fill that in a single gulp and write it to disk likewise. That will give your OS a chance to do some medium-scale planning for disk space, and you'll need fewer system-level calls too.压缩似乎是并行化的一个难题,因为压缩器发出的字节是前 W 个输入字节的重要函数,其中 W 是窗口大小。显然,您可以将文件分成多个部分,并为在自己的线程中运行的每个部分创建独立的压缩流。您可能需要保留一些压缩元数据,以便解压缩器知道如何将文件重新组合在一起。
Compression seems like a hard case for parallelization because the bytes emitted by the compressor are a non-trivial function of the previous W bytes of input, where W is the window size. You can obviously break a file into pieces and create independent compression streams for each of the pieces that run in their own threads. You'll may need to retain some compression metadata so the decompressor knows how to put the file back together.
使用 gzip 进行压缩和解压缩是一个序列化过程。要使用多个线程,您必须创建一个自定义程序将输入文件分解为多个流,然后创建一个自定义程序来解压缩并将它们重新连接在一起。无论哪种方式,IO 都将成为 CPU 使用率之前的瓶颈。
compression and decompression using gzip is a serialized process. to use multiple threads you would have to make a custom program to break up the input file into many streams and then a custom program to decompress and join them back together. either way IO is going to be a bottle neck WAY before CPU usage is.
运行多个虚拟机。每个虚拟机都是一个进程,每个核心应该能够运行至少三个进程,而不会出现任何性能下降。当然,您的应用程序必须能够利用多处理才能受益。没有灵丹妙药,这就是为什么您会在媒体上看到文章抱怨我们还不知道如何使用多核机器。
然而,有很多人将他们的应用程序构建为一个主进程,该主进程管理一组工作进程并将工作包分配给它们。并非所有问题都可以通过这种方式解决。
Run multiple VMs. Each VM is a process and you should be able to run at least three processes per core without suffering any drop in performance. Of course, your application would have to be able to leverage multiprocessing in order to benefit. There is no magic bullet which is why you see articles in the press moaning about how we don't yet know how to use multicore machines.
However, there are lots of people out there who have structured their applications into a master which manages a pool of worker processes and parcels out work packages to them. Not all problems are amenable to being solved this way.
我认为认为多线程 IO 总是邪恶的想法是错误的。您可能需要分析您的特定情况以确保安全,因为:
您可能需要调整读取缓冲区,使其足够大以降低切换成本。在边界情况下,可以将所有文件读入内存并并行解压缩 - 速度更快,并且 IO 多线程没有任何损失。然而,不那么极端的方法也可能效果更好。
您也无需执行任何特殊操作即可在 JRE 上使用多个可用内核。不同的线程通常会使用由操作系统管理的不同核心。
I think it is a mistake to assume that multithreading IO is always evil. You probably need to profile your particular case to be sure, because:
You may need to tune your read buffer, to make it large enough to reduce the switching costs. On the boundary case, one can read all files into memory and decompress there in parallel - faster and no any loss on IO multithreading. However something less extreme may also work better.
You also do not need to do anything special to use multiple available cores on JRE. Different threads will normally use different cores as managed by the operating system.
你不能并行化标准的
GZipInputStream
,它是单线程的,但是你可以管道将解压流解压并处理到不同的线程中,即将GZipInputStream设置为生产者和任何将其作为消费者处理的东西,并将它们与有界阻塞队列连接起来。You can't parallelize the standard
GZipInputStream
, it is single threaded, but you can pipeline decompression and processing of the decompressed stream into different threads, i.e. set up the GZipInputStream as a producer and whatever processes it as a consumer, and connect them with a bounded blocking queue.