Perl - 在制表符分隔的文本文件中拆分列并用新值替换列时出现问题

发布于 2024-11-19 18:49:06 字数 1470 浏览 0 评论 0原文

我有一个制表符分隔符。由许多行和列组成的文本文件。我想更改前两列的内容，然后将修改后的文件写入新文件。
在更改之前，每行的前两列看起来像这样：

COLUMN1:                                              
dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN2:    dip:DIP-48957N|uniprotkb:P49281

我希望它们只包含每列末尾的 id 号，所以我希望它们如下所示：

COLUMN1:        Q96PU5          

COLUMN 2:       P49281

我已经在选项卡上拆分了行以获取各个列。然后拆分前 2 列以获得所需的 ID 号 ($prot_id)。然后我尝试用 ID 替换第 1 列和第 2 列的内容。但是更改后的文件中的输出并不符合我的预期。相反，它看起来像这样：

  COLUMN1:                                           
Q96PU5|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN 2:
P49281|uniprotkb:P49281

仅列的第一部分已被替换。我已经玩了几个小时了，无法弄清楚我做错了什么。非常感谢任何帮助。我的代码如下：

#!/usr/bin/perl  

use warnings;
use strict;


my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt'); 
my @lines = <INFILE>;


foreach $_ (@lines) {
    my @columns = split('\t', $_);

            my $col1 = $columns[0];
            my $col2 = $columns[1];


            my @split_col1 = split ('uniprotkb:', $col1);
            my @split_col2 = split ('uniprotkb:', $col2);

            my $prot_id1 = $split_col1[length(@split_col1)];
            my $prot_id2 = $split_col2[length(@split_col2)];

            print $prot_id1, "\n";

             s/$col1/$prot_id1/;
             s/$col2/$prot_id2/;

            print {$outfile} $_; 
}



exit;

原文

I have a tab delim. text file comprised of a number of rows and columns. I want to change the contents of the first two columns, then write the amended file to a new file.
Before changing , the first two columns of each line look something like this:

COLUMN1:                                              
dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN2:    dip:DIP-48957N|uniprotkb:P49281

I want them to just contain the id number at the end of each column, so I want them to be as follows:

COLUMN1:        Q96PU5          

COLUMN 2:       P49281

I have split the lines at the tab to get the individual columns. Then split the first 2 columns to get the required ID number ($prot_id). Then I have tried substituting the ID for the contents of columns 1 and 2. However the output in the changed file is not as I expect. It instead looks something like this:

  COLUMN1:                                           
Q96PU5|refseq:NP_056092|uniprotkb:Q96PU5    

COLUMN 2:
P49281|uniprotkb:P49281

Just the first part of the columns has been substituted. I have been playing around with this for hours and cannot figure out what I'm doing wrong. Any help greatly appreciated.
My code is as follows:

#!/usr/bin/perl  

use warnings;
use strict;


my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt'); 
my @lines = <INFILE>;


foreach $_ (@lines) {
    my @columns = split('\t', $_);

            my $col1 = $columns[0];
            my $col2 = $columns[1];


            my @split_col1 = split ('uniprotkb:', $col1);
            my @split_col2 = split ('uniprotkb:', $col2);

            my $prot_id1 = $split_col1[length(@split_col1)];
            my $prot_id2 = $split_col2[length(@split_col2)];

            print $prot_id1, "\n";

             s/$col1/$prot_id1/;
             s/$col2/$prot_id2/;

            print {$outfile} $_; 
}



exit;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月朦胧 2024-11-26 18:49:06

已经有一些不错的答案，但我想向您展示一个更简单的解决方案。这个脚本，你可以像这样使用：

$ script.pl DIP.txt > DIP_changed.txt

脚本本身真的很简单：

while (<>) {
    s/\S+uniprotkb:(\S+)/$1/;
    s/\S+uniprotkb:(\S+)/$1/;
    print;
}

它不需要比这更复杂。

There's already some decent answers, but I'd like to show you a simpler solution. This script, you'd use like this:

$ script.pl DIP.txt > DIP_changed.txt

And the script itself is really just:

while (<>) {
    s/\S+uniprotkb:(\S+)/$1/;
    s/\S+uniprotkb:(\S+)/$1/;
    print;
}

It doesn't need to be more complicated than that.

回复收藏 0 原文

地狱即天堂 2024-11-26 18:49:06

尝试这样的事情：

这是一个简洁的 Perl 习惯用法 - 在正则表达式上匹配一个字符串

$columns[0]=~/:((\w|\d)*)$/;

（请注意，有两个用括号定义的原子）并分配匹配的结果（无论第一个、第二个、等等原子）到一个数组 - 或数组列表中的一组标量变量，如下所示：

($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;

看，你走在正确的轨道上，但你让它变得比需要的更难:)

#!/usr/bin/perl  

use warnings;
use strict;

my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt');


foreach my $line (<INFILE>) {
    print "The input line is $line\n";
    my @columns = split('\t', $line);

    ($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;
    ($columns[1]) = $columns[1]=~/:((\w|\d)*)$/;

    printf  "The output line is  %s\n", join ',', @columns;
    printf  $outfile join ',', @columns;

    }

Try something like this:

This is a neat Perl idiom - match a string on a regular expression like this

$columns[0]=~/:((\w|\d)*)$/;

(note that there are two atoms defined there with the parentheses) and assign the results of the matches (whatever is in the 1st, 2nd, and so on atoms) to an array - or to a set of scalar variables in an array list, like this:

($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;

See, you were on the right track but you were making it harder than it needed to be :)

#!/usr/bin/perl  

use warnings;
use strict;

my $file = 'DIP.txt';

open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'DIP_changed.txt');


foreach my $line (<INFILE>) {
    print "The input line is $line\n";
    my @columns = split('\t', $line);

    ($columns[0]) = $columns[0]=~/:((\w|\d)*)$/;
    ($columns[1]) = $columns[1]=~/:((\w|\d)*)$/;

    printf  "The output line is  %s\n", join ',', @columns;
    printf  $outfile join ',', @columns;

    }

回复收藏 0 原文

时间海 2024-11-26 18:49:06

ratsbane 的答案非常好，但您可能想知道在工作数小时后为什么您得到了这样的答案。原因是 $col1 中有一个管道。这是正则表达式中的“OR”。因此，当您尝试替换正则表达式 $col1 时，您正在将

dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5

Now 作为正则表达式进行查找和替换，它匹配什么？它只匹配，

dip:DIP-41935N

所以 that 就是被替换的！

希望有帮助！

ratsbane's answer was pretty good, but you probably want to know after hours of working why you got the answer you did. The reason is that $col1 had a pipe in it. That is an "OR" in a regex. So when you tried to substitute for the regex $col1, you were doing a find and replace over

dip:DIP-41935N|refseq:NP_056092|uniprotkb:Q96PU5

Now as a regex, what does it match? It matches only

dip:DIP-41935N

so that is what got replaced!

Hope that helps!

回复收藏 0 原文

蛮可爱 2024-11-26 18:49:06

可能没有真正好的理由从一开始就将文件放入其中，而不是逐行处理它。逐行处理会更好地扩展。考虑到这一点，我会这样做：

use warnings;
use strict;


my $file = 'DIP.txt';

open my $in_fh, '<', $file or die $!;
open my $out_fh, '>', 'new' . $file or die $!;

while ( <$in_fh> ) {
    chomp;
    next unless length $_; # Skip blank lines.
    my ( @columns ) = split /\s+/, $_; # Split on whitespace (you may prefer \t).
    foreach my $column ( @columns ) {
        ( $column ) = $column =~ m{([^:]+)$};
    }
    local $" = "\t";
    print $out_fh "@columns\n";
}

首先，这在输入文件和输出文件上使用 open 的三个参数版本。这是一个值得养成的好习惯。接下来，它使用词法文件句柄而不是旧的 fileglob 文件句柄。当词法超出范围时会自动关闭，并且不会成为全局符号表的一部分。

接下来，脚本读取文件并逐行处理它，以避免误读。如果文件可能会变大，或者您处于内存使用率很高的环境中，这可能会很有用。除非你有充分的理由吸食，否则最好养成不这样做的习惯。

然后我在空白处分开。您可以按选项卡进行拆分。除非列中嵌入了空格，否则任何一种方法都有效。然后，我迭代这两列，匹配并捕获该列末尾不是冒号的所有内容。或者换句话说，最后一个冒号之后的所有内容。我将结果捕获回 $column 变量，该变量为 @columns 中的相应元素设置别名。这样，当我完成时，@columns 只保存我的捕获。

最后，处理完这两列后，我们本地化 $"，为其分配一个制表符。这样，当我们通过将 @columns 括在引号中来打印两列时，插值会自动在两列之间再次粘贴制表符。如果您愿意一个不同的字符，您现在知道在哪里更改它，

然后 while 循环将跳过任何空白行，

以了解三参数 open 的说明。作为词法文件句柄、正则表达式的解释、Perl 的特殊变量（例如 $"）以及引号插值的工作原理。

好问题！

There's probably no really good reason to slurp the file in at the beginning, rather than processing it line by line. Processing line by line will scale better. With that in mind, I would do it this way:

use warnings;
use strict;


my $file = 'DIP.txt';

open my $in_fh, '<', $file or die $!;
open my $out_fh, '>', 'new' . $file or die $!;

while ( <$in_fh> ) {
    chomp;
    next unless length $_; # Skip blank lines.
    my ( @columns ) = split /\s+/, $_; # Split on whitespace (you may prefer \t).
    foreach my $column ( @columns ) {
        ( $column ) = $column =~ m{([^:]+)$};
    }
    local $" = "\t";
    print $out_fh "@columns\n";
}

First, this uses the three arg version of open on both the input file and the output file. This is a good habit to get into. Next, it uses lexical filehandles instead of the old fileglob filehandles. Lexicals auto-close when they pass out of scope, and don't become part of the global symbol table.

Next, the script reads the file and process it line by line, to avoid slurping. This could be advantageous if the file potentially grows large, or if you're in an environment where memory usage is at a premium. Unless you have a good reason to slurp, may as well get in the habit of not doing so.

Then I split on whitespace. You could split on tabs. Unless there's embedded whitespace in the columns either way works. Then I iterate over the two columns, matching and capturing from each everything at the end of the column that is not a colon. Or another way of putting it, everything that comes after the last colon. I capture the result right back into the $column variable, which aliases the corresponding element in @columns. That way, when I'm done @columns only holds my captures.

Finally, after processing the two columns, we localize $", assigning to it a tab character. That way when we print the two columns by wrapping @columns in quotes, the interpolation automatically sticks a tab character between the columns again. If you prefer a different character, you now know where to change it.

Then the while loop moves on to the next line. Any blank lines will be skipped.

see perldoc open, perlretut, perlvar, and perlop for an explanation of three-arg open as well as lexical filehandles, an explanation of regexps, Perl's special variables such as $", and how quotish interpolation works.

Good question!

回复收藏 0 原文

~没有更多了~