如何读取文件并为每一行记录

发布于 2024-10-06 12:31:30 字数 2249 浏览 0 评论 0原文

寻求有关编写 Perl 程序的帮助，该程序接受输入文件并根据后续命令执行操作。我是一名 Perl 初学者，所以请不要给出太提前的建议。到目前为止我的结构是一个主程序和 4 个子程序。

我在两个部分遇到问题：

写入主段的一部分，为输入文件（固定宽度格式）中的每一行创建唯一的记录。我认为这应该用 substr 来完成，但我不太了解它应该如何构造。 Unpack 到目前为止超出了我的学习范围。
主程序中调用的函数之一是“距离”子函数，它将计算原子之间的距离。我认为这应该是 For 循环内的 For 循环。关于我应该采取什么方法有什么想法吗？

记录应存储一组原子记录（每个换行一个记录/原子）：

• 原子的序列号，5 位数字。（第 7 - 11 栏）

• 所属氨基酸的三字母名称（第 18 - 20 栏）

• 原子的三坐标实数（十进制和十进制）正交坐标 (x,y,z)（第 31 - 54 列）
对于 X，以埃为单位。 31-38
对于 Y，以埃为单位。 39-46
对于 Z，单位为埃列。 47-54

• 原子的一个或两个字母元素名称（例如 C、O、N、Na）（第 77-78 栏）

sub 距离 # 获取原子记录数组并返回最大距离
# 该数组中所有原子对之间。（第 31-54 栏）

以下是输入文件中的示例文本。

# truncating for testing purposes. Actual data is aprox. 100 columns     
# and starts with ATOM or HETATM    
__DATA__   
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C    
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C    
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O

这是到目前为止我所拥有的 make 记录的主记录和子记录。我讨厌蹩脚，但我还没有任何可显示的距离子项，所以不用担心提供代码，任何有关如何处理的建议将非常感激。

use warnings;
use strict; 

my @fields;
my @recs;

while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
  freq => \&freq,
  length => \&length,
  density => \&density,
  help => \&help, 
  quit => \&quit
);

print "Enter a command: ";
  while ( <STDIN> ) {
  chomp; 
  my @line = split( /\s+/);
  my $command = shift @line;
  if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
  }
    else {
    $command_table{$command}->();
  }
print "Enter a command: ";
}

sub makeRecord 
# Read the entire line and make records from the lines that contain the 
# word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
 }

原文

Looking for help on writing a Perl program that takes an input file and performs manipulations based on follow-up commands. I'm a beginning Perl student so please don't get too advance in suggestions. The structure that I have so far is a main program and 4 subs.

I'm having trouble with two parts:

Writing the portion of the main segment that creates a unique record for each line from the input file (which is fixed width format). I think this should be done with substr but I don't know much more of how this should be structured. Unpack is beyond the scope of my learning so far.
One of the functions called in the main program is a "distance" sub which will calculate distance between atoms. I'm thinking this should be a For Loop inside a For loop. Any thoughts on what approach I should take?

The records should store an array of atom records (one record/atom per newline):

• The atom's serial number, 5 digits. (cols 7 - 11)

• The three-letter name of the amino acid to which it belongs (cols 18 - 20)

• The atom's three coordinates real number as decimal & Orthogonal Coordinates (x,y,z) (cols 31 - 54 )
For X in Angstroms cols. 31-38
For Y in Angstroms cols. 39-46
For Z in Angstroms cols. 47-54

• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )

sub Distance
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. (cols 31-54)

Here is sample text from an input file.

# truncating for testing purposes. Actual data is aprox. 100 columns     
# and starts with ATOM or HETATM    
__DATA__   
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C    
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C    
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O

Here is what I have so far for the main and sub for make records. I hate to be lame but I don't have anything to show for the Distance sub yet so don't worry about giving code, any suggestions on how to approach would be very appreciated.

use warnings;
use strict; 

my @fields;
my @recs;

while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
  freq => \&freq,
  length => \&length,
  density => \&density,
  help => \&help, 
  quit => \&quit
);

print "Enter a command: ";
  while ( <STDIN> ) {
  chomp; 
  my @line = split( /\s+/);
  my $command = shift @line;
  if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
  }
    else {
    $command_table{$command}->();
  }
print "Enter a command: ";
}

sub makeRecord 
# Read the entire line and make records from the lines that contain the 
# word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
 }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

刘备忘录 2024-10-13 12:31:30

在线有 Perl 代码可用于处理 PDB 文件（显然您正在这样做）。我并不是建议只使用您下载的模块并完成它，因为您的老师肯定不会批准，而且您也不会学到那么多；）但是您可以看一下提供的一些代码，然后尝试看看其中的某些部分是否可以解决您的问题。

我快速进行了谷歌搜索，发现有 ParsePDB.pm（例如）。您可以在此处找到该网页。不过，我没有查看代码或功能，我只是希望其中有一些内容对您有所帮助。

编辑1

好吧，现在已经14小时过去了，我想做一些编码，所以由于你还没有接受答案，我想我可以忽略我自己的建议并起草一些东西（正如你会注意到我已经复制了 Zaid 的数据结构）...

#!/usr/bin/perl

use warnings;
use strict;

sub makeRecord {
   my ($ser_num, $aa, $x, $y, $z, $element) = @_;
   # copying Zaid now as her/his structure looks very sensible!
   my $record = {
                  serial  => $ser_num,
                  aa      => $aa,
                  element => $element,
                  xyz     => [$x, $y, $z],
                };
   return $record;
}


my $file = shift @ARGV;
my @records; # will be an array of hash references

open FILE, "<$file" or die "$!";
while (<FILE>) {
   if (/^ATOM|^HETATM/) { # only get the structure data lines
      chomp; # not necessary here, but good practice I'd say

      my @fields = split; # by default 'split' splits on whitespace

      # now use an array slice to only pass the array elements
      # you're interested in (using the positional indices from @fields):
      push @records, makeRecord(@fields[1,3,6,7,8,11]);
   }
}
close FILE;

编辑 2

关于距离子例程：for 循环内的 for 循环应该可以完成这项工作，但这是暴力方式，可能需要相当长的时间while（因为您必须进行 (number_of_atoms)^2 计算），具体取决于输入分子的大小。出于您的任务目的，暴力方法可能是可以接受的；在其他情况下，您必须决定是否支持编码的简易性或计算速度。如果您的老师也希望您记住后者，您可以查看此页面（我知道你实际上想要最大的距离，而且你是在 3D 中，而不是 2D...）

好吧，现在我只是希望你能在这里找到一些有用的点滴:)

There's Perl code available online for working with PDB files (which obviously you are doing). I'm not suggesting just using a module you downloaded and be done with it, as surely your instructor wouldn't approve, and you wouldn't learn that much ;) But you could take a look at some of the code that's offered and try to see whether some bits there address your problem.

I did a quick bit of googling, I saw that there's ParsePDB.pm (for example). You can find the web page here. I didn't have a look at the code or the functionality though, I'm just hoping there will be something in there that you may find helpful.

EDIT 1

Okay, it's 14 hours later now, and I felt like doing some coding, so as you have not yet accepted an answer I thought I could just ignore my own advice and draw up something (as you will notice I have copied Zaid's data structure)...

#!/usr/bin/perl

use warnings;
use strict;

sub makeRecord {
   my ($ser_num, $aa, $x, $y, $z, $element) = @_;
   # copying Zaid now as her/his structure looks very sensible!
   my $record = {
                  serial  => $ser_num,
                  aa      => $aa,
                  element => $element,
                  xyz     => [$x, $y, $z],
                };
   return $record;
}


my $file = shift @ARGV;
my @records; # will be an array of hash references

open FILE, "<$file" or die "$!";
while (<FILE>) {
   if (/^ATOM|^HETATM/) { # only get the structure data lines
      chomp; # not necessary here, but good practice I'd say

      my @fields = split; # by default 'split' splits on whitespace

      # now use an array slice to only pass the array elements
      # you're interested in (using the positional indices from @fields):
      push @records, makeRecord(@fields[1,3,6,7,8,11]);
   }
}
close FILE;

EDIT 2

Concerning the distance subroutine: the for loop inside the for loop should do the job, but this is the brute force way which might take quite a while (as you'd have to do (number_of_atoms)^2 calculations), depending on the size of your input molecule. For the purpose of your assignment the brute force approach is probably acceptable; in other cases you'd have to decide whether to favour ease of coding, or computational speed. If your instructor also wants you to keep the latter in mind, you could take a look at this page (I know you actually want the largest distance, and you're in 3D, not 2D...)

Ok, now I just hope that you managed to find some helpful bits and pieces in here :)

回复收藏 0 原文

沧笙踏歌 2024-10-13 12:31:30

奇怪的是，当我可以时 unpack 超出了范围请参阅调度表的使用。如果正在处理固定格式的文件，那么忽略使用 unpack 是很愚蠢的。下面的代码中没有发生任何“高级”事情：

use strict;
use warnings;
use Data::Dump 'dump';   # Use this if you want 'dump' function to work

my @records;
while ( my $record = <DATA> ) {

    next unless $record =~ /^ATOM|^HETATM/;  # Skip unwanted records

    # unpack minimizes the amount of work the code has to do ...
    # ... especially since you only want a small part of the file
    # 'x' tokens are ignored, 'A' tokens are read ...
    # The number following each token represents repetition count ...
    # ... so in this case the first 6 characters are ignored ...
    # ... and the next 5 are assigned to $serNo

    my ( $serNo, $aminoAcid, $xCoord, $yCoord, $zCoord )
        = unpack 'x6A5x6A3x10A10A10A10', $record;        # Get only what you want

    # Assign data to a hash reference

    my $recordStructure = {
                            serialnumber => $serNo,
                            aminoacid    => $aminoAcid,
                            coordinates  => [ $xCoord, $yCoord, $zCoord ],
                          };

    push @records, $recordStructure;  # Append current record
}

# 'dump' is really useful to view data structures. No need for PrintRec!!

dump @records;

It is strange that unpack is out of scope when I can see use of a dispatch table. It would be silly to overlook using unpack if fixed-format files are being processed. There is nothing 'advanced' going on in the code below:

use strict;
use warnings;
use Data::Dump 'dump';   # Use this if you want 'dump' function to work

my @records;
while ( my $record = <DATA> ) {

    next unless $record =~ /^ATOM|^HETATM/;  # Skip unwanted records

    # unpack minimizes the amount of work the code has to do ...
    # ... especially since you only want a small part of the file
    # 'x' tokens are ignored, 'A' tokens are read ...
    # The number following each token represents repetition count ...
    # ... so in this case the first 6 characters are ignored ...
    # ... and the next 5 are assigned to $serNo

    my ( $serNo, $aminoAcid, $xCoord, $yCoord, $zCoord )
        = unpack 'x6A5x6A3x10A10A10A10', $record;        # Get only what you want

    # Assign data to a hash reference

    my $recordStructure = {
                            serialnumber => $serNo,
                            aminoacid    => $aminoAcid,
                            coordinates  => [ $xCoord, $yCoord, $zCoord ],
                          };

    push @records, $recordStructure;  # Append current record
}

# 'dump' is really useful to view data structures. No need for PrintRec!!

dump @records;

回复收藏 0 原文

眼泪都笑了 2024-10-13 12:31:30

您的记录具有固定宽度的格式，因此请使用 unpack 将每条记录分解为感兴趣的字段。使用每个字段的规定列位置构建一个用于 unpack 的模板。

my @field_specs = (
    {begin =>  7, end => 11, name => 'serialnumber'},
    {begin => 18, end => 20, name => 'aminoacid'},
    {begin => 31, end => 38, name => 'X'}, 
    {begin => 39, end => 46, name => 'Y'},
    {begin => 47, end => 54, name => 'Z'}, 
    {begin => 77, end => 78, name => 'element'},
);
my $unpack_template;    
my @col_names;
for my $spec (@field_specs) {
    my $offset = $spec->{begin} - 1;
    my $width  = $spec->{end} - $offset;
    $template .= "\@${offset}A$width";
    push @col_names, $spec->{name};
}
print "Ready to read @col_names\n using template $template ...\n";

# prints 
# Ready to read serialnumber aminoacid X Y Z element 
#  using template @6A5@17A3@30A8@38A8@46A8@76A2 ...

my @recs;
while ( <DATA> ) {                
    my %record;
    @record{@col_names} = unpack($unpack_template, $_);    
    push @recs, \%record;                
}

Your records have a fixed-width format, so use unpack to break each record into the fields of interest. Use the stated column positions of each field to construct a template for use with unpack.

my @field_specs = (
    {begin =>  7, end => 11, name => 'serialnumber'},
    {begin => 18, end => 20, name => 'aminoacid'},
    {begin => 31, end => 38, name => 'X'}, 
    {begin => 39, end => 46, name => 'Y'},
    {begin => 47, end => 54, name => 'Z'}, 
    {begin => 77, end => 78, name => 'element'},
);
my $unpack_template;    
my @col_names;
for my $spec (@field_specs) {
    my $offset = $spec->{begin} - 1;
    my $width  = $spec->{end} - $offset;
    $template .= "\@${offset}A$width";
    push @col_names, $spec->{name};
}
print "Ready to read @col_names\n using template $template ...\n";

# prints 
# Ready to read serialnumber aminoacid X Y Z element 
#  using template @6A5@17A3@30A8@38A8@46A8@76A2 ...

my @recs;
while ( <DATA> ) {                
    my %record;
    @record{@col_names} = unpack($unpack_template, $_);    
    push @recs, \%record;                
}

回复收藏 0 原文

~没有更多了~