使用 AWK 和 BASH 将大型压缩文件拆分为多个输出

发布于 2024-11-25 13:16:05 字数 502 浏览 2 评论 0原文

我有一个大的（3GB）gzip 压缩文件，其中包含两个字段：NAME 和 STRING。我想将此文件拆分为更小的文件 - 如果字段一是 john_smith，我希望将该字符串放置在 john_smith.gz 中。注意：字符串字段可以并且确实包含特殊字符。

我可以使用 BASH 在域上的 for 循环中轻松完成此操作，但我更喜欢使用 AWK 一次读取文件的效率。

我尝试在 awk 中使用系统函数，并在字符串

zcat large_file.gz | 周围使用转义单引号。 awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'

并且它在大多数情况下都能完美运行行，但是其中一些被打印到 STDERR 并给出 shell 无法执行命令的错误（shell 认为字符串的一部分是命令）。看起来特殊字符可能会破坏它。

关于如何解决这个问题的任何想法，或者任何有帮助的替代实现？

谢谢！

-肖恩

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伪装你 2024-12-02 13:16:05

您面临着时间与磁盘空间之间的巨大权衡。
我假设您试图通过将记录附加到 ${name}.gz 文件的末尾来节省空间。 @sehe 评论和代码绝对值得考虑。

无论如何，您的时间比 3 GB 磁盘空间更有价值。为什么不尝试

 zcat large_file.gz \
 | awk '-F\t' { 
    name=$1; string=$2; outFile=name".txt"
    print name "\t" string >> outFile
    # close( outFile) 
   }'

 echo *.txt | xargs gzip -9

您可能需要取消注释#close(outFile)。
包含 xargs 是因为我假设您将创建超过 1000 个文件名。即使您不这样做，使用该技术也不会有什么坏处。

请注意，此代码假定制表符分隔数据，根据需要更改 -F 的 arg 值，并更改打印语句中的“\t”以提供所需的字段分隔符。

没有时间测试这个。如果您喜欢这个想法但遇到困难，请发布小样本数据、预期输出以及您收到的错误消息。

我希望这有帮助。

You're facing a big trade off in time vs disk space.
I assume you're trying to save space by appending records to the end of your ${name}.gz files. @sehe comments and code are definitely worth considering.

In anycase, your time is more valuable that 3 GB of diskspace. Why not try

 zcat large_file.gz \
 | awk '-F\t' { 
    name=$1; string=$2; outFile=name".txt"
    print name "\t" string >> outFile
    # close( outFile) 
   }'

 echo *.txt | xargs gzip -9

You may need to uncomment the #close(outFile).
The xargs is included because I'm assuming you're going to have more that 1000 filenames created. Even if you don't it won't hurt to use that technique.

Note this code assumes tab delimited data, change the value of arg for -F as needed and the "\t" in the print statment to give the field separator you need.

Don't have time to test this. If you like this idea and get stuck, please post small sample data, expected output, and error messages that you're getting.

I hope this helps.

回复收藏 0 原文

仅此而已 2024-12-02 13:16:05

这个小 Perl 脚本可以很好地

保持所有目标文件打开以提高性能，
执行错误基本处理
Edit 现在还通过 gzip 即时通过管道输出

$fh 有点混乱，因为显然直接使用哈希条目是行不通的

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
    exists $pipes{$id} 
        or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") 
        or die "can't open/create $id$suffix, or cannot spawn gzip";

    my $fh = $pipes{$id};
    print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

哦，像这样使用它

zcat input.gz | ./myscript.pl

This little perl script does the job nicely

keeping all destination files open for performance
doing error elementary handling
Edit now also pipes output through gzip on the fly

There is a bit of a kludge with $fh because apparently using the hash entry directly doesn't work

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
    exists $pipes{$id} 
        or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") 
        or die "can't open/create $id$suffix, or cannot spawn gzip";

    my $fh = $pipes{$id};
    print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

Oh, use it like

zcat input.gz | ./myscript.pl

回复收藏 0 原文

黯然 2024-12-02 13:16:05

创建这个程序，例如 largesplitter.c 并使用命令

zcat large_file.gz | largesplitter

未经修饰的程序是：

#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
        char    buf [32000];  // todo:  resize this if the second field is larger than 
        char    cmd [120];
        long    linenum = 0;
        while (fgets (buf, sizeof buf, stdin))
        {
                ++linenum;
                char *cp = strchr (buf, '\t');   // identify first field delimited by tab
                if (!cp)
                {
                        fprintf (stderr, "line %d missing delimiter\n", linenum);
                        continue;
                }
                *cp = '\000';  // split line
                FILE *out = fopen (buf, "w");
                if (!out)
                {
                        fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
                        continue;
                }
                fprintf (out, "%s", cp+1);
                fclose (out);
                snprintf (cmd, sizeof cmd, "gzip %s", buf);
                system (cmd);
        }
        return 0;
}

这在我的系统上编译没有错误，但我还没有测试它的功能。

Create this program as, say largesplitter.c and use the command

zcat large_file.gz | largesplitter

The unadorned program is:

#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
        char    buf [32000];  // todo:  resize this if the second field is larger than 
        char    cmd [120];
        long    linenum = 0;
        while (fgets (buf, sizeof buf, stdin))
        {
                ++linenum;
                char *cp = strchr (buf, '\t');   // identify first field delimited by tab
                if (!cp)
                {
                        fprintf (stderr, "line %d missing delimiter\n", linenum);
                        continue;
                }
                *cp = '\000';  // split line
                FILE *out = fopen (buf, "w");
                if (!out)
                {
                        fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
                        continue;
                }
                fprintf (out, "%s", cp+1);
                fclose (out);
                snprintf (cmd, sizeof cmd, "gzip %s", buf);
                system (cmd);
        }
        return 0;
}

This compiles without error on my system, but I have not tested its functionality.

回复收藏 0 原文