如何解析文件、创建记录并对记录执行操作,包括术语频率和距离计算
我是 Perl 入门课程的一名学生,正在寻找有关我编写一个分析原子数据的小型(但棘手)程序的方法的建议和反馈。我的教授鼓励论坛。我对 Perl 子程序或模块(包括 Bioperl)并不熟悉,因此请将回复限制在适当的“初学者水平”,以便我可以理解并从您的建议和/或代码中学习(也请限制“Magic”)。
该计划的要求如下:
从命令行读取文件(包含有关 Atom 的数据)创建原子记录数组(每个换行一个记录/原子)。对于每条记录,程序需要存储:
<块引用>• 原子的序列号(第 7 - 11 列)
• 所属氨基酸的三字母名称(第 18 - 20 栏)
• 原子的三个坐标 (x,y,z)(第 31 - 54 列)
• 原子的一个或两个字母元素名称(例如C、O、N、Na)(第77-78 栏)提示输入三个命令之一:频率、长度、密度 d(d 是某个数字):
<块引用>• freq - 文件中每种类型原子的数量(例如氮、钠等将显示如下:N: 918 S: 23
• length - 坐标之间的距离
• 密度d(其中d 是一个数字)- 程序将提示输入要保存计算的文件名,并将包含该原子与每个其他原子之间的距离。如果该距离小于或等于数字 d,则会增加该距离内的原子数计数,除非该计数在文件中为零。输出将类似于:
1:5
2:3
3:6
...(非常大的文件),完成后将关闭。
我正在寻找有关我在下面的代码中编写(和需要编写)的内容的反馈。我特别感谢任何有关如何编写我的字幕的反馈。我在底部提供了示例输入数据。
我看到的程序结构和功能描述:
$^W = 1; # turn on warnings
use strict; # behave!
my @fields;
my @recs;
while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}
for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my @line = split( /\s+/);
my $command = shift @line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ @_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
I'm a student in an intro Perl class, looking for suggestions and feedback on my approach to writing a small (but tricky) program that analyzes data about atoms. My professor encourages forums. I am not advanced with Perl subs or modules (including Bioperl) so please limit responses to an appropriate 'beginner level' so that I can understand and learn from your suggestions and/or code (also limit "Magic" please).
The requirements of the program are as follows:
Read a file (containing data about Atoms) from the command line & create an array of atom records (one record/atom per newline). For each record the program will need to store:
• The atom's serial number (cols 7 - 11)
• The three-letter name of the amino acid to which it belongs (cols 18 - 20)
• The atom's three coordinates (x,y,z) (cols 31 - 54 )
• The atom's one- or two-letter element name (e.g. C, O, N, Na) (cols 77-78 )Prompt for one of three commands: freq, length, density d (d is some number):
• freq - how many of each type of atom is in the file (example Nitrogen, Sodium, etc would be displayed like this: N: 918 S: 23
• length - The distances among coordinates
• density d (where d is a number) - program will prompt for the name of a file to save computations to and will containing the distance between that atom and every other atom. If that distance is less than or equal to the number d, it increments the count of the number of atoms that are within that distance, unless that count is zero into the file. The output will look something like:
1: 5
2: 3
3: 6
... (very big file) and will close when it finishes.
I'm looking for feedback on what I have written (and need to write) in the code below. I especially appreciate any feedback in how to approach writing my subs. I've included sample input data at the bottom.
The program structure and function descriptions as I see it:
$^W = 1; # turn on warnings
use strict; # behave!
my @fields;
my @recs;
while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}
for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
freq => \&freq,
length => \&length,
density => \&density,
help => \&help,
quit => \&quit
);
print "Enter a command: ";
while ( <STDIN> ) {
chomp;
my @line = split( /\s+/);
my $command = shift @line;
if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
print "Command must be: freq, length, density or quit\n";
}
else {
$command_table{$command}->();
}
print "Enter a command: ";
}
sub makeRecord
# Read the entire line and make records from the lines that contain the
# word ATOM or HETATM in the first column. Not sure how to do this:
{
my %record =
(
serialnumber => shift,
aminoacid => shift,
coordinates => shift,
element => [ @_ ]
);
return\%record;
}
sub freq
# take an array of atom records, return a hash whose keys are
# distinct atom names and whose values are the frequences of
# these atoms in the array.
sub length
# take an array of atom records and return the max distance
# between all pairs of atoms in that array. My instructor
# advised this would be constructed as a for loop inside a for loop.
sub density
# take an array of atom records and a number d and will return a
# hash whose keys are atom serial numbers and whose values are
# the number of atoms within that distance from the atom with that
# serial number.
sub help
{
print "To use this program, type either\n",
"freq\n",
"length\n",
"density followed by a number, d,\n",
"help\n",
"quit\n";
}
sub quit
{
exit 0;
}
# truncating for testing purposes. Actual data is aprox. 100 columns
# and starts with ATOM or HETATM.
__DATA__
ATOM 4743 CG GLN A 704 19.896 32.017 54.717 1.00 66.44 C
ATOM 4744 CD GLN A 704 19.589 30.757 55.525 1.00 73.28 C
ATOM 4745 OE1 GLN A 704 18.801 29.892 55.098 1.00 75.91 O
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看来您的 Perl 技能在使用引用和复杂的数据结构方面取得了很大进步。以下是一些提示和一般建议。
使用
use warnings
启用警告,而不是$^W = 1
。前者是自记录的,其优点是位于封闭块的本地而不是全局设置。使用命名良好的变量,这将有助于记录程序的行为,而不是依赖 Perl 的特殊
$_
。例如:在用户输入场景中,无限循环提供了一种避免重复指令(例如“输入命令”)的方法。请参见下文。
您的正则表达式可以简化以避免重复锚点的需要。请参阅下文。
作为一般规则,肯定测试比否定测试更容易理解。请参阅下面修改后的
if-else
结构。将程序的每个部分包含在其自己的子例程中。由于多种原因,这是一个很好的一般实践,所以我就开始这个习惯。
一个相关的良好实践是尽量减少全局变量的使用。作为练习,您可以尝试编写程序,使其完全不使用全局变量。相反,任何需要的信息都将在子例程之间传递。对于小型程序,不一定需要严格避免全局变量,但牢记理想并不是一个坏主意。
为您的
length
子例程指定一个不同的名称。该名称已被内置length
函数使用。关于您关于
makeRecord
的问题,一种方法是忽略makeRecord
内部的过滤问题。相反,makeRecord 可以包含一个额外的哈希字段,并且过滤逻辑将驻留在其他地方。例如:上述一些要点的说明:
It looks like your Perl skills are advancing nicely -- using references and complex data structures. Here are a few tips and pieces of general advice.
Enable warnings with
use warnings
rather than$^W = 1
. The former is self-documenting and has the advantage being local to the enclosing block rather than being a global setting.Use well-named variables, which will help document the program's behavior, rather than relying on Perl's special
$_
. For example:In user-input scenarios, an endless loop provides a way to avoid repeated instructions like "Enter a command". See below.
Your regex can be simplified to avoid the need for repeated anchors. See below.
As a general rule, affirmative tests are easier to understand than negative tests. See the modified
if-else
structure below.Enclose each part of program within its own subroutine. This is a good general practice for a bunch of reasons, so I would just start the habit.
A related good practice is to minimize the use of global variables. As an exercise, you could try to write the program so that it uses no global variables at all. Instead, any needed information would be passed around between the subroutines. With small programs one does not necessarily need to be rigid about the avoidance of globals, but it's not a bad idea to keep the ideal in mind.
Give your
length
subroutine a different name. That name is already used by the built-inlength
function.Regarding your question about
makeRecord
, one approach is to ignore the filtering issue insidemakeRecord
. Instead,makeRecord
could include an additional hash field, and the filtering logic would reside elsewhere. For example:An illustration of some of the points above:
FM的回答非常好。我只想提一些额外的事情:
您已经有了包含有效命令的散列(这是一个好主意)。无需在正则表达式中复制该列表。我会做这样的事情:
请注意,我还将
@line
传递给子例程,因为密度命令需要它。不带参数的子例程可以忽略它们。您还可以使用
keys %command_table
生成错误消息的有效命令列表,但我将其作为练习留给您。另一件事是输入文件的描述提到了列号,这表明它是固定宽度的格式。最好使用
substr
或unpack
进行解析。如果字段为空或包含空格,则您的拆分将无法正确解析它。 (如果您使用substr
,请注意,它对从 0 开始的列进行编号,而人们通常将第一列标记为 1。)FM's answer is pretty good. I'll just mention a couple of additional things:
You already have a hash with the valid commands (which is a good idea). There's no need to duplicate that list in a regex. I'd do something like this:
Notice I'm also passing
@line
to the subroutine, because you'll need that for the density command. Subroutines that don't take arguments can just ignore them.You could also generate the list of valid commands for the error message by using
keys %command_table
, but I'll leave that as an exercise for you.Another thing is that the description of the input file mentions column numbers, which suggests that it's a fixed-width format. That's better parsed with
substr
orunpack
. If a field is ever blank or contains a space, then your split will not parse it correctly. (If you usesubstr
, be aware that it numbers columns starting at 0, when people often label the first column 1.)