C# 快速/高效地压缩大量数据块

发布于 2024-12-16 22:04:36 字数 681 浏览 1 评论 0原文

我有大约 270k 个数据块对,每对由一个 32KiB 和一个 16KiB 块组成。

当我将它们保存到一个文件时,我当然会得到一个非常大的文件。 但数据很容易压缩。
用WinRAR对5.48GiB的文件进行强压缩后,得到的文件大小为37.4MiB。

但我需要随机访问每个单独的块,因此我只能单独压缩这些块。
为此,我使用了 .NET 提供的 Deflate 类,它将文件大小减少到 382MiB(我可以接受)。
但速度不够好。

大部分速度损失可能是由于总是为每个块创建一个新的 MemoryStream 和 Deflate 实例。 但它们似乎并不是为重复使用而设计的。

我猜想(很多?)当使用“全局”字典而不是每个块都有一个字典时,可以实现更好的压缩。

是否有适合该任务的压缩算法的实现(最好用 C# 实现)?

以下链接包含每个字节数出现的百分比,分为三种块类型(仅限 32KiB 块)。 第一个和第三个块类型的出现率为 37.5%,第二个块类型的出现率为 25%。 区块类型百分比

长文件短篇故事: Type1 主要由 组成。 Type2 主要由 0 和 1 组成 Type3 大部分由零组成 大于 128 的值尚未出现。

16KiB 块几乎总是由零组成

I have around 270k data block pairs, each pair consists of one 32KiB and one 16KiB block.

When I save them to one file I of course get a very large file.
But the data is easily compressed.
After compressing the 5.48GiB file with WinRAR, with strong compression, the resulting file size is 37.4MiB.

But I need random access to each individual block, so I can only compress the blocks individually.
For that I used the Deflate class provided by .NET, which reduced the file size to 382MiB (which I could live with).
But the speed is not good enough.

A lot of the speed loss is probably due to always creating a new MemoryStream and Deflate instance for each block.
But it seems they aren't designed to be reused.

And I guess (much?) better compression can be achieved when a "global" dictionary is used instead having one for each block.

Is there an implementation of a compression algorithm (preferably in C#) which is suited for that task?

The following link contains the percentage with which each byte number occurs, divided into three block types (32KiB blocks only).
The first and third block type has an occurrence of 37,5% and the second 25%.
Block type percentages

Long file short story:
Type1 consists mostly of ones.
Type2 consists mostly of zeros and ones
Type3 consists mostly of zeros
Values greater than 128 do not occur (yet).

The 16KiB block consists almost always of zeros

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

仙女 2024-12-23 22:04:36

如果您想尝试不同的压缩,您可以从适合您的数据的 RLE 开始 - http:// /en.wikipedia.org/wiki/Run-length_encoding - 即使在最简单的实现中,它也会非常快。相关的 http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms 包含更多内容如果您想推出自己的算法或找到某人的实现,请链接到开始其他算法。

随机评论:“......很多速度损失可能是......”不是解决性能问题的方法。测量一下,看看是不是真的。

If you want to try different compression you can start with RLE which shoud be suitable for your data - http://en.wikipedia.org/wiki/Run-length_encoding - it will be blazingly fast even in simplest implemetation. The related http://en.wikipedia.org/wiki/Category:Lossless_compression_algorithms contains more links to start on other algorithm if you want to roll you own or find someone's implementation.

Random comment: "...A lot of the speed loss is probably ..." is not a way to solve performance problem. Measure and see if it really is.

暮年 2024-12-23 22:04:36

Gzip 被称为“精细”,这意味着压缩率还可以,并且速度也不错。
如果您想要更多压缩,还有其他替代方案,例如 7z。

如果您想要更快的速度(这似乎是您的目标),则更快的替代方案将以牺牲一些压缩效率为代价提供显着的速度优势。 “显着”应翻译为快很多倍,例如5x-10x。此类算法适用于“内存中”压缩方案(例如您的方案),因为它们使访问压缩块几乎毫不费力。

例如,Clayton Stangeland 刚刚发布了 C# 版 LZ4。源代码可在 BSD 许可证下获得:
https://github.com/stangelandcl/LZ4Sharp

项目主页上有一些与 gzip 的比较指标,例如:

i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s  
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%

希望这有帮助。

Gzip is known to be "fine", which means compression ratio is okay, and speed is good.
If you want more compression, other alternatives exist, such as 7z.

If you want more speed, which seems your objective, a faster alternative will provide a significant speed advantage at the cost of some compression efficiency. "Significant" shall be translated into many times faster, such as 5x-10x. Such algorithms are favored for "in-memory" compression scenarios, such as yours, since they make accessing the compressed block almost painless.

As an example, Clayton Stangeland just released LZ4 for C#. The source code is available here under a BSD license :
https://github.com/stangelandcl/LZ4Sharp

There are some comparisons metrics with gzip on the project homepage, such as :

i5 memcpy 1658 MB/s
i5 Lz4 Compression 270 MB/s Decompression 1184 MB/s  
i5 LZ4C# Compression 207 MB/s Decompression 758 MB/s 49%
i5 LZ4C# whole corpus Compression 267 MB/s Decompression 838 MB/s Ratio 47%
i5 gzip whole corpus Compression 48 MB/s Decompression 266 MB/s Ratio 33%

Hope this helps.

¢蛋碎的人ぎ生 2024-12-23 22:04:36

无论您多么努力,您都无法随机访问 Deflate 流(除非您放弃 LZ77 部分,但这正是使您的压缩比现在如此之高的主要原因 - 即便如此,仍然存在棘手的问题规避)。这是因为压缩数据的一个部分允许引用最多 32K 字节的前一部分,这也可能依次引用另一部分,等等,并且您最终必须从头开始解码流以获得即使您确切知道它在压缩流中的位置(目前您还不知道)。

但是,您可以做的是使用一个流将许多(但不是全部)块压缩在一起。然后,您将获得相当好的速度和压缩,但您不必解压缩所有块来获得您想要的块;只是您的块所在的特定块。您需要一个额外的索引来跟踪每个压缩块块在文件中的开始位置,但这是相当低的开销。将其视为将所有内容压缩在一起(这对于压缩非常有用,但不利于随机访问)和单独压缩每个块(这对于随机访问非常有用,但不利于压缩和速度)之间的折衷。

You can't have random access to a Deflate stream, no matter how hard you try (unless you forfeit the LZ77 part, but that's what's mostly responsible for making your compression ratio so high right now -- and even then, there's tricky issues to circumvent). This is because one part of the compressed data is allowed to refer to previous part up to 32K bytes back, which may also refer to another part in turn, etc. and you end up having to start decoding the stream from the beginning to get the data you want, even if you know exactly where it is in the compressed stream (which, currently, you don't).

But, what you could do is compress many (but not all) blocks together using one stream. Then you'd get fairly good speed and compression, but you wouldn't have to decompress all the blocks to get at the one you wanted; just the particular chunk that your block happens to be in. You'd need an additional index that tracks where each compressed chunk of blocks starts in the file, but that's fairly low overhead. Think of it as a compromise between compressing everything together (which is great for compression but sucks for random access), and compressing each chunk individually (which is great for random access but sucks for compression and speed).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文