使用 AWK 和 BASH 将大型压缩文件拆分为多个输出

发布于 2024-11-25 13:16:05 字数 502 浏览 2 评论 0原文

我有一个大的(3GB)gzip 压缩文件,其中包含两个字段:NAME 和 STRING。我想将此文件拆分为更小的文件 - 如果字段一是 john_smith,我希望将该字符串放置在 john_smith.gz 中。注意:字符串字段可以并且确实包含特殊字符。

我可以使用 BASH 在域上的 for 循环中轻松完成此操作,但我更喜欢使用 AWK 一次读取文件的效率。

我尝试在 awk 中使用系统函数,并在字符串

zcat large_file.gz | 周围使用转义单引号。 awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'

并且它在大多数情况下都能完美运行行,但是其中一些被打印到 STDERR 并给出 shell 无法执行命令的错误(shell 认为字符串的一部分是命令)。看起来特殊字符可能会破坏它。

关于如何解决这个问题的任何想法,或者任何有帮助的替代实现?

谢谢!

-肖恩

I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if field one is john_smith, I want the string to be placed in john_smith.gz. NOTE: the string field can and does contain special characters.

I can do this easily in a for loop over the domains using BASH, but I'd much prefer the efficiency of reading the file in once using AWK.

I have tried using the system function within awk with escaped single quotes around the string

zcat large_file.gz | awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'

and it works perfectly on most of the lines, however some of them are printed to STDERR and give an error that the shell cannot execute a command (the shell thinks that part of the string is a command). It looks like special characters might be breaking it.

Any thoughts on how to fix this, or any alternate implementations that would help?

Thanks!

-Sean

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

伪装你 2024-12-02 13:16:05

您面临着时间与磁盘空间之间的巨大权衡。
我假设您试图通过将记录附加到 ${name}.gz 文件的末尾来节省空间。 @sehe 评论和代码绝对值得考虑。

无论如何,您的时间比 3 GB 磁盘空间更有价值。为什么不尝试

 zcat large_file.gz \
 | awk '-F\t' { 
    name=$1; string=$2; outFile=name".txt"
    print name "\t" string >> outFile
    # close( outFile) 
   }'

 echo *.txt | xargs gzip -9

您可能需要取消注释#close(outFile)。
包含 xargs 是因为我假设您将创建超过 1000 个文件名。即使您不这样做,使用该技术也不会有什么坏处。

请注意,此代码假定制表符分隔数据,根据需要更改 -F 的 arg 值,并更改打印语句中的“\t”以提供所需的字段分隔符。

没有时间测试这个。如果您喜欢这个想法但遇到困难,请发布小样本数据、预期输出以及您收到的错误消息。

我希望这有帮助。

You're facing a big trade off in time vs disk space.
I assume you're trying to save space by appending records to the end of your ${name}.gz files. @sehe comments and code are definitely worth considering.

In anycase, your time is more valuable that 3 GB of diskspace. Why not try

 zcat large_file.gz \
 | awk '-F\t' { 
    name=$1; string=$2; outFile=name".txt"
    print name "\t" string >> outFile
    # close( outFile) 
   }'

 echo *.txt | xargs gzip -9

You may need to uncomment the #close(outFile).
The xargs is included because I'm assuming you're going to have more that 1000 filenames created. Even if you don't it won't hurt to use that technique.

Note this code assumes tab delimited data, change the value of arg for -F as needed and the "\t" in the print statment to give the field separator you need.

Don't have time to test this. If you like this idea and get stuck, please post small sample data, expected output, and error messages that you're getting.

I hope this helps.

仅此而已 2024-12-02 13:16:05

这个小 Perl 脚本可以很好地

  • 保持所有目标文件打开以提高性能,
  • 执行错误基本处理
  • Edit 现在还通过 gzip 即时通过管道输出

$fh 有点混乱,因为显然直接使用哈希条目是行不通的

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
    exists $pipes{$id} 
        or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") 
        or die "can't open/create $id$suffix, or cannot spawn gzip";

    my $fh = $pipes{$id};
    print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

哦,像这样使用它

zcat input.gz | ./myscript.pl

This little perl script does the job nicely

  • keeping all destination files open for performance
  • doing error elementary handling
  • Edit now also pipes output through gzip on the fly

There is a bit of a kludge with $fh because apparently using the hash entry directly doesn't work

#!/usr/bin/perl
use strict;
use warnings;

my $suffix = ".txt.gz";

my %pipes;
while (my ($id, $line) = split /\t/,(<>),2)
{
    exists $pipes{$id} 
        or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") 
        or die "can't open/create $id$suffix, or cannot spawn gzip";

    my $fh = $pipes{$id};
    print $fh $line;
}

print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

Oh, use it like

zcat input.gz | ./myscript.pl
黯然 2024-12-02 13:16:05

创建这个程序,例如 largesplitter.c 并使用命令

zcat large_file.gz | largesplitter

未经修饰的程序是:

#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
        char    buf [32000];  // todo:  resize this if the second field is larger than 
        char    cmd [120];
        long    linenum = 0;
        while (fgets (buf, sizeof buf, stdin))
        {
                ++linenum;
                char *cp = strchr (buf, '\t');   // identify first field delimited by tab
                if (!cp)
                {
                        fprintf (stderr, "line %d missing delimiter\n", linenum);
                        continue;
                }
                *cp = '\000';  // split line
                FILE *out = fopen (buf, "w");
                if (!out)
                {
                        fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
                        continue;
                }
                fprintf (out, "%s", cp+1);
                fclose (out);
                snprintf (cmd, sizeof cmd, "gzip %s", buf);
                system (cmd);
        }
        return 0;
}

这在我的系统上编译没有错误,但我还没有测试它的功能。

Create this program as, say largesplitter.c and use the command

zcat large_file.gz | largesplitter

The unadorned program is:

#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (void)
{
        char    buf [32000];  // todo:  resize this if the second field is larger than 
        char    cmd [120];
        long    linenum = 0;
        while (fgets (buf, sizeof buf, stdin))
        {
                ++linenum;
                char *cp = strchr (buf, '\t');   // identify first field delimited by tab
                if (!cp)
                {
                        fprintf (stderr, "line %d missing delimiter\n", linenum);
                        continue;
                }
                *cp = '\000';  // split line
                FILE *out = fopen (buf, "w");
                if (!out)
                {
                        fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno));
                        continue;
                }
                fprintf (out, "%s", cp+1);
                fclose (out);
                snprintf (cmd, sizeof cmd, "gzip %s", buf);
                system (cmd);
        }
        return 0;
}

This compiles without error on my system, but I have not tested its functionality.

酷到爆炸 2024-12-02 13:16:05

也许可以尝试以下方法:

zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")< /code>

我自己没有尝试过,因为我没有任何大文件可以玩。

Maybe try something along the lines of:

zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")

I haven't tried it myself, as I don't have any large files to play with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文