将标头插入文件

发布于 2024-11-07 11:47:10 字数 1233 浏览 0 评论 0原文

我想听听您关于如何将标题行（文件中的所有行）插入另一个文件（更大，几 GB）的指示。我更喜欢 Unix/awk/sed 的方式来完成这项工作。

# header I need to insert to another, they are in a file named "header".


##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO

原文

I would like to hear your directions on how to insert lines of header (all lines in a file) to another file (more bigger, several GB). I prefer the Unix/awk/sed ways of do that job.

# header I need to insert to another, they are in a file named "header".


##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自控 2024-11-14 11:47:10

header="/name/of/file/containing/header"
for file in "$@"
do
    cat "$header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

您可能更愿意将临时文件放置在与正在编辑的文件相同的文件系统上，但是任何需要在文件前面插入数据的操作最终都会非常接近于此。如果您要整天、每天都这样做，您可能会组装一些更巧妙的东西，但节省的时间很可能微乎其微（每个文件几分之一秒）。

如果您真的必须使用 sed，那么我想您可以使用：

header="/name/of/file/containing/header"
for file in "$@"
do
    sed -e "0r $header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

该命令读取第 0 行“之后”标头的内容（第 1 行之前），然后其他所有内容都保持不变。但这并不像 cat 那么快。

使用 awk 的类似构造是：

header="/name/of/file/containing/header"
for file in "$@"
do
    awk '{print}' "$header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

这只是在输出上打印每个输入行；再说一遍，没有cat那么快。

cat 相对于 sed 或 awk 的另一优势；即使大文件主要是二进制数据（它不知道文件的内容），cat 也会工作。 sed 和 awk 都旨在处理分成行的数据；虽然现代版本可能甚至可以很好地处理二进制数据，但这并不是它们的设计目的。

header="/name/of/file/containing/header"
for file in "$@"
do
    cat "$header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

You might prefer to locate the temporary file on the same file system as the file you are editing, but anything that requires inserting data at the front of the file is going to end up working very close to this. If you are going to be doing this all day, every day, you might assemble something a little slicker, but the chances are the savings will be minuscule (fractions of a second per file).

If you really, really must use sed, then I suppose you could use:

header="/name/of/file/containing/header"
for file in "$@"
do
    sed -e "0r $header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

The command reads the content of header 'after' line 0 (before line 1), and then everything else is passed through unchanged. This isn't as swift as cat though.

An analogous construct using awk is:

header="/name/of/file/containing/header"
for file in "$@"
do
    awk '{print}' "$header" "$file" > /tmp/xx.$
    mv /tmp/xx.$ "$file"
done

This simply prints each input line on the output; again, not as swift as cat.

One more advantage of cat over sed or awk; cat will work even if the big files are mainly binary data (it is oblivious to the content of the files). Both sed and awk are designed to handle data split into lines; while modern versions will probably handle even binary data fairly well, it is not what they are designed for.

回复收藏 0 原文

番薯 2024-11-14 11:47:10

我使用 Perl 脚本完成了这一切，因为我必须遍历目录树并以不同的方式处理各种文件类型。基本脚本是

#!perl -w
process_directory(".");

sub process_directory {
    my $dir = shift;
    opendir DIR, $dir or die "$dir: not a directory\n";
    my @files = readdir DIR;
    closedir DIR;
    foreach(@files) {
        next if(/^\./ or /bin/ or /obj/);  # ignore some directories
        if(-d "$dir/$_") {
            process_directory("$dir/$_");
        } else {
            fix_file("$dir/$_");
        }
    }
}

sub fix_file {
    my $file = shift;
    open SRC, $file or die "Can't open $file\n";
    my $file = "$file-f";
    open FIX, ">$fix" or die "Can't open $fix\n";
    print FIX <<EOT;
        -- Text to insert
EOT
    while(<SRC>) {
        print FIX;
    }
    close SRC;
    close FIX;
    my $oldfile = $file;
    $oldFile =~ s/(.*)\.\(\w+)$/$1-old.$2/;
    if(rename $file, $oldFile) {
        rename $fix, $file;
    }
}

分享并享受！或者不是——我不是 Perl 黑客，所以这可能是双加非最佳 Perl 代码。不过，它对我有用！

I did it all with a Perl script, because I had to traverse a directory tree and handle various file types differently. The basic script was

#!perl -w
process_directory(".");

sub process_directory {
    my $dir = shift;
    opendir DIR, $dir or die "$dir: not a directory\n";
    my @files = readdir DIR;
    closedir DIR;
    foreach(@files) {
        next if(/^\./ or /bin/ or /obj/);  # ignore some directories
        if(-d "$dir/$_") {
            process_directory("$dir/$_");
        } else {
            fix_file("$dir/$_");
        }
    }
}

sub fix_file {
    my $file = shift;
    open SRC, $file or die "Can't open $file\n";
    my $file = "$file-f";
    open FIX, ">$fix" or die "Can't open $fix\n";
    print FIX <<EOT;
        -- Text to insert
EOT
    while(<SRC>) {
        print FIX;
    }
    close SRC;
    close FIX;
    my $oldfile = $file;
    $oldFile =~ s/(.*)\.\(\w+)$/$1-old.$2/;
    if(rename $file, $oldFile) {
        rename $fix, $file;
    }
}

Share and enjoy! Or not -- I'm no Perl hacker, so this is probably double-plus-unoptimal Perl code. Still, it worked for me!

回复收藏 0 原文

~没有更多了~