拆分 gzip 压缩的日志文件而不将解压缩的拆分存储在磁盘上

发布于 2024-09-28 03:31:37 字数 420 浏览 6 评论 0原文

我有一个经常性的任务，即将一组大型（每个大约 1-2 GiB）gzip 压缩的 Apache 日志文件分成几个部分（比如 500K 行的块）。最终文件应再次压缩以限制磁盘使用。

在 Linux 上我通常会这样做：

zcat biglogfile.gz | split -l500000

生成的文件将被命名为 xaa、xab、xac 等所以我这样做：

gzip x*

这种方法的效果是，作为中间结果，这些巨大文件被临时存储在磁盘上。有没有办法避免这种中间磁盘使用？

我可以（以类似于 xargs 的方式）通过命令（如 gzip）分割管道输出并即时重新压缩输出吗？或者我看错了方向，有没有更好的方法来做到这一点？

谢谢。

原文

I have a recurring task of splitting a set of large (about 1-2 GiB each) gzipped Apache logfiles into several parts (say chunks of 500K lines). The final files should be gzipped again to limit the disk usage.

On Linux I would normally do:

zcat biglogfile.gz | split -l500000

The resulting files files will be named xaa, xab, xac, etc
So I do:

gzip x*

The effect of this method is that as an intermediate result these huge files are temporarily stored on disk. Is there a way to avoid this intermediate disk usage?

Can I (in a way similar to what xargs does) have split pipe the output through a command (like gzip) and recompress the output on the fly?
Or am I looking in the wrong direction and is there a much better way to do this?

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笙痞 2024-10-05 03:31:37

您可以使用 split --filter 选项，如手册例如

zcat biglogfile.gz | split -l500000 --filter='gzip > $FILE.gz'

编辑：不知道何时引入--filter选项，但根据评论，它在core utils 8.4中不起作用。

You can use the split --filter option as explained in the manual e.g.

zcat biglogfile.gz | split -l500000 --filter='gzip > $FILE.gz'

Edit: not aware when --filter option was introduced but according to comments, it is not working in core utils 8.4.

回复收藏 0 原文

对不⑦ 2024-10-05 03:31:37

像下面这样的脚本可能就足够了。

#!/usr/bin/perl
use PerlIO::gzip;

$filename = 'out';
$limit = 500000;

$fileno = 1;
$line = 0;

while (<>) {
    if (!$fh || $line >= $limit) { 
        open $fh, '>:gzip', "$filename_$fileno"; 
        $fileno++;
        $line = 0; 
    }
    print $fh $_; $line++;
}

A script like the following might suffice.

#!/usr/bin/perl
use PerlIO::gzip;

$filename = 'out';
$limit = 500000;

$fileno = 1;
$line = 0;

while (<>) {
    if (!$fh || $line >= $limit) { 
        open $fh, '>:gzip', "$filename_$fileno"; 
        $fileno++;
        $line = 0; 
    }
    print $fh $_; $line++;
}

回复收藏 0 原文

败给现实 2024-10-05 03:31:37

如果人们需要保留每个部分中的第一行（标题），

zcat bigfile.csv.gz | tail -n +2 | split -l1000000 --filter='{ { zcat bigfile.csv.gz | head -n 1 | gzip; gzip; } > $FILE.gz; };'

我知道这有点笨拙。我正在寻找更优雅的解决方案。

In case people need to keep the 1st row (the header) in each of the pieces

zcat bigfile.csv.gz | tail -n +2 | split -l1000000 --filter='{ { zcat bigfile.csv.gz | head -n 1 | gzip; gzip; } > $FILE.gz; };'

I know it's a bit clunky. I'm looking for a more elegant solution.

回复收藏 0 原文

狼性发作 2024-10-05 03:31:37

有 zipsplit，但它使用 zip 算法而不是 gzip 算法。

回复收藏 0 原文

~没有更多了~

关于作者

半山落雨半山空

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

拆分 gzip 压缩的日志文件而不将解压缩的拆分存储在磁盘上

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

杨绘峰

听闻余生

谜兔

xiaotwins

你说

若能看破又如何

友情链接

拆分 gzip 压缩的日志文件而不将解压缩的拆分存储在磁盘上

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

杨绘峰

听闻余生

谜兔

xiaotwins

你说

若能看破又如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。