在多个线程上并行调用 ICsharpCode.SharpZipLib 是否安全
我们当前使用 ICsharpCode.SharpZipLib 库的 GZipOutputStream 类进行压缩。我们通过一个线程来完成它。
我想将输入数据流分成块并并行压缩它们。 我担心这个库内部可能有一些静态数据,这些静态数据会被多个线程覆盖,从而损坏结果流。
任何想法将不胜感激。
We are currently using for compression the GZipOutputStream class of ICsharpCode.SharpZipLib library. We do it from a single thread.
I want to split my input data stream into chunks and compress them in parallel.
I'm worried though that this library may have some statics inside which will be overwritten from multiple threads and therefore corrupt the resulting stream.
Any thoughts will be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一个非常有趣的问题。压缩是高度 CPU 密集型的,依赖于大量的搜索和比较。因此,当您拥有多个具有不受阻碍的内存访问的 CPU 时,想要并行化它是非常合适的。
DotNetZip 库中有一个名为 ParallelDeflateOutputStream 的类,它可以执行您所描述的操作。该类记录在此处。
它只能用于压缩 - 不能解压缩。而且它严格来说是一个输出流 - 您无法
读取
以进行压缩。考虑到这些限制,它基本上是一个 DeflateOutputStream,内部使用多个线程。它的工作方式:它将传入流分解为块,然后将每个块放入单独的工作线程中以进行单独压缩。然后,它最后将所有这些压缩流合并回一个有序流。
假设流维护的“块”大小是 N 字节。当调用者调用 Write() 时,数据被缓冲到存储桶或块中。在
Stream.Write()
方法中,当第一个“存储桶”已满时,它会调用ThreadPool.QueueUserWorkItem
,将存储桶分配给工作项。对流的后续写入开始填充下一个存储桶,当该存储桶已满时,Stream.Write()
再次调用QUWI
。每个工作线程使用“同步”的“刷新类型”(请参阅 deflate 规范)压缩其存储桶,然后将其压缩的 blob 标记为准备输出。然后,这些不同的输出被重新排序(因为块 n 不一定在块 n+1 之前被压缩),并写入捕获输出流。当每个存储桶被写入时,它被标记为空,准备好由下一个 Stream.Write() 重新填充。每个块必须使用 Sync 的刷新类型进行压缩,以便允许它们通过简单的串联重新组合,使组合的字节流成为合法的 DEFLATE 流。最后的块需要Flush type = Finish。此流的设计意味着调用者不需要使用多个线程进行写入。调用者只需像平常一样创建流,就像用于输出的普通 DeflateStream 一样,然后写入其中。流对象使用多个线程,但您的代码不直接与它们交互。 ParallelDeflateOutputStream 的“用户”代码如下所示:
它设计用于在 DotNetZip ZipFile 类中使用,但作为独立的压缩输出流非常有用。生成的流可以用任何充气器去 DELFATED(充气?)。结果完全符合规范。
流是可调整的。您可以设置它使用的缓冲区的大小以及并行级别。它不会无限制地创建存储桶,因为对于大型流(GB 规模等),这会导致内存不足的情况。因此,存储桶的数量以及可支持的并行度都有固定的限制。
在我的双核机器上,与标准 DeflateStream 相比,该流类几乎使大型(100mb 及更大)文件的压缩速度提高了一倍。我没有更大的多核机器,所以我无法进一步测试它。代价是并行实现使用更多的 CPU 和更多的内存,并且由于我上面描述的同步帧,压缩效率也稍低(大文件减少 1%)。性能优势将根据输出流上的 I/O 吞吐量以及存储是否能跟上并行压缩器线程的速度而变化。
注意:
它是 DEFLATE 流,而不是 GZIP。有关差异,请阅读 RFC 1951 (DEFLATE) 和
RFC 1952 (GZIP)。
但如果您确实需要 gzip,可以使用此流的源代码,因此您可以查看它,也许可以为自己获得一些想法。 GZIP 实际上只是 DEFLATE 之上的一个包装器,带有一些附加元数据(如 Adler 校验和等 - 请参阅规范)。在我看来,构建一个 ParallelGzipOutputStream 并不是很困难,但也可能不是微不足道的。
对我来说最棘手的部分是让 Flush() 和 Close() 的语义正常工作。
编辑
只是为了好玩,我构建了一个 ParallelGZipOutputStream,它基本上完成了我上面描述的 GZip 操作。它使用.NET 4.0 的任务代替QUWI 来处理并行压缩。我刚刚在通过马尔可夫链引擎生成的 100mb 文本文件上进行了测试。我将该课程的结果与其他一些选项进行了比较。它看起来像这样:
结论:
.NET 内置的 GZipStream 速度相当快。而且效率也不是很高,而且
它不可调。
DotNetZip 中的普通(非并行化)GZipStream 上的“BestSpeed”比 .NET 内置流快大约 20%,并且提供大约相同的压缩。
将普通的 DotNetZip GZipStream 与并行的进行比较,在我的双核笔记本电脑(3GB RAM)上使用多个任务进行压缩可以将所需的时间减少大约 45%。我想对于具有更多核心的机器来说,节省的时间会更多。
并行 GZIP 是有成本的 - 分帧会使压缩文件的大小增加约 4%。这不会随着使用的核心数量而改变。
生成的 .gz 文件可以通过任何 GZIP 工具解压缩。
This is a really interesting question. Compression is highly CPU intensive, relying on lots of searching and comparisons. So it's very appropriate to want to parallelize it, when you've got multiple CPUs with unimpeded memory access.
There is a class called
ParallelDeflateOutputStream
within the DotNetZip library that does what you are describing. The class is documented here.It can be used only for compression - no decompression. Also it is strictly an output stream - you cannot
read
in order to compress. Considering these constraints, it is basically a DeflateOutputStream, that internally uses multiple threads.The way it works: It breaks up the incoming stream into chunks, then drops off each chunk into a separate worker thread to be compressed individually. Then it merges all those compressed streams back into one ordered stream at the end.
Suppose the "chunk" size maintained by the stream is N bytes. As the caller invokes Write(), data is buffered into a bucket or chunk. Inside the
Stream.Write()
method, when the first "bucket" is full, it callsThreadPool.QueueUserWorkItem
, allocating the bucket to the workitem. Subsequent writes into the stream begin filling the next bucket, and when that is full,Stream.Write()
callsQUWI
again. Each worker thread compresses its bucket, using a "Flush Type" ofSync
(see the deflate spec), and then marks its compressed blob ready for output. This various outputs are then re-ordered (because chunk n does not necessarily get compressed before chunk n+1), and written to the captive output stream. As each bucket is written, it is marked empty, ready to be re-filled by the nextStream.Write()
. Each chunk must be compressed with the flush type of Sync in order to allow their re-combination via simple concatenation, for the combined bytestream to be a legal DEFLATE Stream. The final chunk needs Flush type = Finish.The design of this stream means that callers don't need to write with multiple threads. Callers just create the stream as normal, like the vanilla DeflateStream used for output, and write into it. The stream object uses multiple threads, but your code doesn't interface directly with them. The code for a "user" of the
ParallelDeflateOutputStream
looks like this:It was designed for use within the DotNetZip ZipFile class, but it is quite usable as a standalone compressing output stream. The resulting stream can be de-DELFATED (inflated?) with any inflater. The result is fully compliant to the spec.
The stream is tweakable. You can set the size of the buffers it uses, and the level of parallelism. It doesn't create buckets without bound, because for large streams (gb scale and so on) that would cause out of memory condiitons. So there's a fixed limit to the number of buckets, and therefore the degree of parallelism, that can be supported.
On my dual-core machine, this stream class nearly doubled the compression speed of large (100mb and larger) files, when compared to the standard DeflateStream. I don't have any larger multi-core machines so I couldn't test it further. The tradeoff is that the parallel implementation uses more CPU and more memory, and also compresses slightly less efficiently (1% less for large files) because of the sync framing I described above. The performance advantage will vary depending on the I/O throughput on your output stream, and whether the storage can keep up with the parallel compressor threads.
Caveat:
It is a DEFLATE stream, not GZIP. For the differences, read RFC 1951 (DEFLATE) and
RFC 1952 (GZIP).
But if you really NEED gzip, the source for this stream is available, so you can view it and maybe get some ideas for yourself. GZIP is really just a wrapper on top of DEFLATE, with some additional metadata (like Adler checksum, and so on - see the spec). It seems to me that it would not be very difficult to build a
ParallelGzipOutputStream
, but it may not be trivial, either.The trickiest part for me was getting the semantics of Flush() and Close() to work properly.
EDIT
Just for fun, I built a ParallelGZipOutputStream, that basically does what I described above, for GZip. It uses .NET 4.0's Tasks in lieu of QUWI to handle the parallel compression. I tested it just now on a 100mb text file generated via a Markov Chain engine. I compared the results of that class against some other options. Here's what it looks like:
Conclusions:
The GZipStream that's builtin to .NET is pretty fast. It's also not very efficient, and
it is not tunable.
The "BestSpeed" on the vanilla (non-parallelized) GZipStream in DotNetZip is about 20% faster than the .NET builtin stream, and gives about the same compression.
Using multiple Tasks for compression can cut about 45% off the time required on my dual-core laptop (3gb RAM), comparing the vanilla DotNetZip GZipStream to the parallel one. I suppose the time savings would be higher for machines with more cores.
There is a cost to parallel GZIP - the framing increases the size of the compressed file by about 4%. This won't change with the number of cores used.
The resulting .gz file can be decompressed by any GZIP tool.
我的理解是 zip 正在写入(或读取)单个底层流;所以我的假设将是响亮的不;如果您谈论的是单个底层流,则这不可能是线程安全的。
然而;单独的实例与单独的底层流对话应该没问题;事实上,并行运行单独的(不相关的)任务通常比并行单个任务更容易。
My understanding is that the zip is writing to (or reading from) a single underlying stream; so my assumption would be a resounding no; this cannot be thread safe if you are talking about a single underlying stream.
However; separate instances talking to separate underlying streams should be fine; and indeed it is usually easier to run separate (unrelated) tasks in parallel than it is to parallelise a single task.
在对类进行编码时,确保所有静态成员都是线程安全的是标准做法。所以我认为你不太可能因为这个问题而遇到问题。当然,如果您计划从不同线程使用相同
GZipOutputStream
,那么这肯定会出现问题,因为该类的实例成员不是线程安全的。您可以做的是创建一个线程安全的中间人
Stream
类(想想装饰器模式)并将其传递给GZipOutputStream
。这个自定义流类(称为 ThreadSafeStream )本身会接受一个 Stream 实例,并使用适当的机制来同步对其的访问。您将为每个线程创建一个
GZipOutputStream
实例,并且它们都将共享相同的ThreadSafeStream
包装器实例。我怀疑 ThreadSafeStream 方法中可能会存在很多瓶颈,但您应该能够从中获得一些并行性。It is standard practice to make sure all static members are thread-safe when coding classes. So I would think it is very unlikely that you would have a problem due to that issue. Of course, if you plan on using the same
GZipOutputStream
from different threads then that would definitely be problematic since instance members of that class are not thread-safe.What you might be able to do is to create a thread-safe middleman
Stream
class (think decorator pattern) and pass that to theGZipOutputStream
. This custom stream class, call itThreadSafeStream
, would itself accept aStream
instance and would use the appropriate mechanisms to synchronize access to it.You will create one
GZipOutputStream
instance for each thread and they will all share the sameThreadSafeStream
wrapper instance. I suspect there will probably be a lot of bottlenecking in theThreadSafeStream
methods, but you should be able to gain some parallelism from this.