使用 AWK 和 BASH 将大型压缩文件拆分为多个输出
我有一个大的(3GB)gzip 压缩文件,其中包含两个字段:NAME 和 STRING。我想将此文件拆分为更小的文件 - 如果字段一是 john_smith,我希望将该字符串放置在 john_smith.gz 中。注意:字符串字段可以并且确实包含特殊字符。
我可以使用 BASH 在域上的 for 循环中轻松完成此操作,但我更喜欢使用 AWK 一次读取文件的效率。
我尝试在 awk 中使用系统函数,并在字符串
zcat large_file.gz | 周围使用转义单引号。 awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'
并且它在大多数情况下都能完美运行行,但是其中一些被打印到 STDERR 并给出 shell 无法执行命令的错误(shell 认为字符串的一部分是命令)。看起来特殊字符可能会破坏它。
关于如何解决这个问题的任何想法,或者任何有帮助的替代实现?
谢谢!
-肖恩
I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if field one is john_smith, I want the string to be placed in john_smith.gz. NOTE: the string field can and does contain special characters.
I can do this easily in a for loop over the domains using BASH, but I'd much prefer the efficiency of reading the file in once using AWK.
I have tried using the system function within awk with escaped single quotes around the string
zcat large_file.gz | awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'
and it works perfectly on most of the lines, however some of them are printed to STDERR and give an error that the shell cannot execute a command (the shell thinks that part of the string is a command). It looks like special characters might be breaking it.
Any thoughts on how to fix this, or any alternate implementations that would help?
Thanks!
-Sean
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您面临着时间与磁盘空间之间的巨大权衡。
我假设您试图通过将记录附加到 ${name}.gz 文件的末尾来节省空间。 @sehe 评论和代码绝对值得考虑。
无论如何,您的时间比 3 GB 磁盘空间更有价值。为什么不尝试
您可能需要取消注释#close(outFile)。
包含 xargs 是因为我假设您将创建超过 1000 个文件名。即使您不这样做,使用该技术也不会有什么坏处。
请注意,此代码假定制表符分隔数据,根据需要更改 -F 的 arg 值,并更改打印语句中的“\t”以提供所需的字段分隔符。
没有时间测试这个。如果您喜欢这个想法但遇到困难,请发布小样本数据、预期输出以及您收到的错误消息。
我希望这有帮助。
You're facing a big trade off in time vs disk space.
I assume you're trying to save space by appending records to the end of your ${name}.gz files. @sehe comments and code are definitely worth considering.
In anycase, your time is more valuable that 3 GB of diskspace. Why not try
You may need to uncomment the #close(outFile).
The xargs is included because I'm assuming you're going to have more that 1000 filenames created. Even if you don't it won't hurt to use that technique.
Note this code assumes tab delimited data, change the value of arg for -F as needed and the "\t" in the print statment to give the field separator you need.
Don't have time to test this. If you like this idea and get stuck, please post small sample data, expected output, and error messages that you're getting.
I hope this helps.
这个小 Perl 脚本可以很好地
gzip
即时通过管道输出$fh
有点混乱,因为显然直接使用哈希条目是行不通的哦,像这样使用它
This little perl script does the job nicely
gzip
on the flyThere is a bit of a kludge with
$fh
because apparently using the hash entry directly doesn't workOh, use it like
创建这个程序,例如
largesplitter.c
并使用命令未经修饰的程序是:
这在我的系统上编译没有错误,但我还没有测试它的功能。
Create this program as, say
largesplitter.c
and use the commandThe unadorned program is:
This compiles without error on my system, but I have not tested its functionality.
也许可以尝试以下方法:
zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")< /code>
我自己没有尝试过,因为我没有任何大文件可以玩。
Maybe try something along the lines of:
zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")
I haven't tried it myself, as I don't have any large files to play with.