将标头插入文件
我想听听您关于如何将标题行(文件中的所有行)插入另一个文件(更大,几 GB)的指示。我更喜欢 Unix/awk/sed 的方式来完成这项工作。
# header I need to insert to another, they are in a file named "header".
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO
I would like to hear your directions on how to insert lines of header (all lines in a file) to another file (more bigger, several GB). I prefer the Unix/awk/sed ways of do that job.
# header I need to insert to another, they are in a file named "header".
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能更愿意将临时文件放置在与正在编辑的文件相同的文件系统上,但是任何需要在文件前面插入数据的操作最终都会非常接近于此。如果您要整天、每天都这样做,您可能会组装一些更巧妙的东西,但节省的时间很可能微乎其微(每个文件几分之一秒)。
如果您真的必须使用 sed,那么我想您可以使用:
该命令读取第 0 行“之后”标头的内容(第 1 行之前),然后其他所有内容都保持不变。但这并不像
cat
那么快。使用
awk
的类似构造是:这只是在输出上打印每个输入行;再说一遍,没有
cat
那么快。cat
相对于sed
或awk
的另一优势;即使大文件主要是二进制数据(它不知道文件的内容),cat
也会工作。sed
和awk
都旨在处理分成行的数据;虽然现代版本可能甚至可以很好地处理二进制数据,但这并不是它们的设计目的。You might prefer to locate the temporary file on the same file system as the file you are editing, but anything that requires inserting data at the front of the file is going to end up working very close to this. If you are going to be doing this all day, every day, you might assemble something a little slicker, but the chances are the savings will be minuscule (fractions of a second per file).
If you really, really must use
sed
, then I suppose you could use:The command reads the content of header 'after' line 0 (before line 1), and then everything else is passed through unchanged. This isn't as swift as
cat
though.An analogous construct using
awk
is:This simply prints each input line on the output; again, not as swift as
cat
.One more advantage of
cat
oversed
orawk
;cat
will work even if the big files are mainly binary data (it is oblivious to the content of the files). Bothsed
andawk
are designed to handle data split into lines; while modern versions will probably handle even binary data fairly well, it is not what they are designed for.我使用 Perl 脚本完成了这一切,因为我必须遍历目录树并以不同的方式处理各种文件类型。基本脚本是
分享并享受!或者不是——我不是 Perl 黑客,所以这可能是双加非最佳 Perl 代码。不过,它对我有用!
I did it all with a Perl script, because I had to traverse a directory tree and handle various file types differently. The basic script was
Share and enjoy! Or not -- I'm no Perl hacker, so this is probably double-plus-unoptimal Perl code. Still, it worked for me!