如何使用 Perl 有效地填充 N x M 网格？

发布于 2024-09-28 11:49:34 字数 1865 浏览 0 评论 0原文

我有一个 Perl 脚本，它解析数据文件并写入 5 个填充 1100 x 1300 网格的输出文件。该脚本可以工作，但在我看来，它很笨拙并且可能效率低下。该脚本也是继承的代码，我对其进行了一些修改以使其更具可读性。尽管如此，还是一团糟。

目前，脚本读取数据文件（~4Mb）并将其放入数组中。然后它循环遍历数组，解析其内容并将值推送到另一个数组，最后在另一个 for 循环中将它们打印到文件中。如果未找到某个点的值，则会打印 9999。零是可接受的值。

数据文件有 5 个不同的参数，每个参数都写入自己的文件中。

数据示例：

data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0
...

data for the param: 3
3356
// (x,y) count values

5559是当前参数的数据行数。数据行为：x、y、该特定点的连续 x 值的数量，最后是值。参数之间有一个空行。

正如我之前所说，该脚本可以工作，但我觉得这可以更容易、更有效地完成。我只是不知道怎么办。所以这是一个自我提升的机会。

除了数组和 for 循环的复杂组合之外，还有什么方法可以更好地解决这个问题呢？

编辑：

对此应该更清楚，抱歉。

输出为 1100 x 1300 网格，其中填充了从数据文件读取的值。每个参数都写入不同的文件。数据行上有多个值意味着该行包含 x(+n)、y 点的数据。

更新：

我测试了该解决方案，令我惊讶的是它比原始脚本慢（约 3 秒）。然而，该脚本小了约 50%，这使得更容易真正理解脚本的功能。在这种情况下，这比 3 秒的速度增益更重要。

这里是旧脚本中的一些代码。希望您能从中得到基本的想法。为什么更快？

 for my $i (0..$#indata) { # Data file is read to @indata
 ...
   if($indata[$i] =~ /^data for the param:/) { 
     push @block, $i;  #  data borders aka. lines, where block starts and ends
   }
 ...
 }
  # Then handle the data blocks
 for my $k (0..4) {  # 5 parameters
 ...
   if( $k eq '4') {  # Last parameter
     $enddata = $#indata;
   }
   else {
     $enddata = $block[$k+1];
   }
    ...
   for my $p ($block[$k]..$enddata) { # from current block to next block 
    ...
   # Fill data array
    for(my $m=0 ; $m<$n ; $m++){
    $data[$x][$y] = $values[$m];
     }

   }
   print2file();

 }

原文

I have a Perl script, which parses datafile and writes 5 output files filled with 1100 x 1300 grid. The script works, but in my opinion, it's clumsy and probably non-efficient. The script is also inherited code, which I have modified a little to make it more readable. Still, it's a mess.

At the moment, the script reads the datafile(~4Mb) and puts it into array. Then it loops through array parsing its content and pushing values to another array and finally printing them to file in another for loop. If value is not found for certain point, then it prints 9999. Zero is an acceptable value.

The datafile has 5 different parameters and each of them is written to its own file.

Example of data:

data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0
...

data for the param: 3
3356
// (x,y) count values

5559 is the number of data lines to current parameter. Data line goes: x, y, number of consecutive x-values for that particular point and finally the values.
There is an empty line between parameters.

As I said earlier, the script works, but I feel like this could be done so much easier and more efficiently. I just don't know how. So here's a chance for self-improvement.

What would be better approach to this problem, than a complicated combo of arrays and for-loops?

EDIT:

Should've been more clear on this, sorry.

Output is 1100 x 1300 grid filled with values read from data file. Each parameter is written to different file. More than one values on the data line means, that line has data for x(+n), y points.

UPDATE:

I tested the solution and to my surprise it was slower than original script (~3 seconds). However, the script is ~50% smaller, which makes it lots easier to actually understand what the script does. In this case that's more important than a 3-second speed gain.

Here some of the code from the older script. Hope you'll get the basic idea from it. Why is it faster?

 for my $i (0..$#indata) { # Data file is read to @indata
 ...
   if($indata[$i] =~ /^data for the param:/) { 
     push @block, $i;  #  data borders aka. lines, where block starts and ends
   }
 ...
 }
  # Then handle the data blocks
 for my $k (0..4) {  # 5 parameters
 ...
   if( $k eq '4') {  # Last parameter
     $enddata = $#indata;
   }
   else {
     $enddata = $block[$k+1];
   }
    ...
   for my $p ($block[$k]..$enddata) { # from current block to next block 
    ...
   # Fill data array
    for(my $m=0 ; $m<$n ; $m++){
    $data[$x][$y] = $values[$m];
     }

   }
   print2file();

 }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凶凌 2024-10-05 11:49:34

下面将在哈希中填充稀疏数组。打印时，对于具有未定义值的单元格打印 9999。我更改了代码以将每一行构建为字符串，以减少内存占用。

#!/usr/bin/perl

use strict; use warnings;
use YAML;

use constant GRID_X => 1100 - 1;
use constant GRID_Y => 1300 - 1;

while (my $data = <DATA> ) {
    if ( $data =~ /^data for the param: (\d)/ ) {
        process_param($1, \*DATA);
    }
}

sub process_param {
    my ($param, $fh) = @_;
    my $lines_to_read = <$fh>;
    my $lines_read = 0;

    $lines_to_read += 0;

    my %data;

    while ( my $data = <$fh> ) {
        next if $data =~ m{^//};
        last unless $data =~ /\S/;
        $lines_read += 1;

        my ($x, $y, $n, @vals) = split ' ', $data;

        for my $i ( 0 .. ($n - 1) ) {
            $data{$x + $i}{$y} = 0 + $vals[$i];
        }
    }
    if ( $lines_read != $lines_to_read ) {
        warn "read $lines_read lines, expected $lines_to_read\n";
    }

    # this is where you would open a $param specific output file
    # and write out the full matrix, instead of printing to STDOUT
    # as I have done. As an improvement, you should probably factor
    # this out to another sub.

    for my $x (0 .. GRID_X) {
        my $row;
        for my $y (0 .. GRID_Y) {
            my $v = 9999;
            if ( exists($data{$x})
                    and exists($data{$x}{$y})
                    and defined($data{$x}{$y}) ) {
                $v = $data{$x}{$y};
            }
            $row .= "$v\t";
        }
        $row =~ s/\t\z/\n/;
        print $row;
    }

    return;
}


__DATA__
data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0

The following will fill in a sparse array in a hash. When printing, print 9999 for cells with undefined values. I changed the code to build each row as a string to reduce the memory footprint.

#!/usr/bin/perl

use strict; use warnings;
use YAML;

use constant GRID_X => 1100 - 1;
use constant GRID_Y => 1300 - 1;

while (my $data = <DATA> ) {
    if ( $data =~ /^data for the param: (\d)/ ) {
        process_param($1, \*DATA);
    }
}

sub process_param {
    my ($param, $fh) = @_;
    my $lines_to_read = <$fh>;
    my $lines_read = 0;

    $lines_to_read += 0;

    my %data;

    while ( my $data = <$fh> ) {
        next if $data =~ m{^//};
        last unless $data =~ /\S/;
        $lines_read += 1;

        my ($x, $y, $n, @vals) = split ' ', $data;

        for my $i ( 0 .. ($n - 1) ) {
            $data{$x + $i}{$y} = 0 + $vals[$i];
        }
    }
    if ( $lines_read != $lines_to_read ) {
        warn "read $lines_read lines, expected $lines_to_read\n";
    }

    # this is where you would open a $param specific output file
    # and write out the full matrix, instead of printing to STDOUT
    # as I have done. As an improvement, you should probably factor
    # this out to another sub.

    for my $x (0 .. GRID_X) {
        my $row;
        for my $y (0 .. GRID_Y) {
            my $v = 9999;
            if ( exists($data{$x})
                    and exists($data{$x}{$y})
                    and defined($data{$x}{$y}) ) {
                $v = $data{$x}{$y};
            }
            $row .= "$v\t";
        }
        $row =~ s/\t\z/\n/;
        print $row;
    }

    return;
}


__DATA__
data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0

回复收藏 0 原文

白馒头 2024-10-05 11:49:34

如果使用引用，Perl 支持多维数组。

my $matrix = [];
$matrix->[0]->[0] = $valueAt0x0;

所以你可以一口气读完整篇文章

$matrix = [];
while($ln = <INPUT>) {
  @row = split(/ /, @ln); #assuming input is separated by spaces
  push(@$matrix, \@row);
}
# here you read matrix.  Let's print it
foreach my $row (@$matrix) {
  print join(",", @{$row}) . "\n";
}
# now you pruinted your matrix with "," as a separator

希望这会有所帮助。

Perl supports multidimentional arrays if you use references.

my $matrix = [];
$matrix->[0]->[0] = $valueAt0x0;

So you could read the entire thing in one go

$matrix = [];
while($ln = <INPUT>) {
  @row = split(/ /, @ln); #assuming input is separated by spaces
  push(@$matrix, \@row);
}
# here you read matrix.  Let's print it
foreach my $row (@$matrix) {
  print join(",", @{$row}) . "\n";
}
# now you pruinted your matrix with "," as a separator

Hope this helps.

回复收藏 0 原文

一身骄傲 2024-10-05 11:49:34

由于您没有描述所需的输出，因此不可能知道要写入文件的内容。但这以一种非常灵活的方式完成阅读部分。您可以对正则表达式的数量进行微优化，或者放弃使用隐式主题变量 $_ 以提高易读性。如果您愿意在调用flush_output之前为矩阵的每个单元格提交特定的输出格式（例如“所有值都用逗号连接”），那么您可以去掉最内层的数组，只需执行$矩阵[$x][$y] .= ($matrix[$x][$y] ? ',' : '') 。 join(',', @data); 或类似且不那么晦涩的东西。

use strict;
use warnings;

my $cur_param;
my @matrix;
while (<DATA>) {
  chomp;
  s/\/\/.*$//;
  next if /^\s*$/;

  if (/^data for the param: (\d+)/) {
    flush_output($cur_param, \@matrix) if defined $cur_param;
    $cur_param = $1;
    @matrix = (); # reset
    # skip the line with number of rows, we're smarter than that
    my $tmp = <DATA>;
    next;
  }

  (my $x, my $y, undef, my @data) = split /\s+/, $_;
  $matrix[$x][$y] ||= [];
  push @{$matrix[$x][$y]}, @data;
}

sub flush_output {
  my $cur_param = shift;
  my $matrix = shift;
  # in reality: open file and dump
  # ... while dumping, do an ||= [9999] for the default...

  # here: simple debug output:
  use Data::Dumper;
  print "\nPARAM $cur_param\n";
  print Dumper $matrix;
}

__DATA__
data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0

data for the param: 3
3356
// (x,y) count values

Since you don't describe your desired output, it's impossible to know what to write to the files. But this does the reading part in a pretty flexible way. You could probably micro-optimize the number of regular expressions or lose the use of the implicit topic variable $_ for improved legibility. If you are willing to commit to a certain output format for each cell of the matrix before calling flush_output (such as "all values joined by commas"), then you can do away with the innermost layer of arrays and just do $matrix[$x][$y] .= ($matrix[$x][$y] ? ',' : '') . join(',', @data); or something similar and less obscure.

use strict;
use warnings;

my $cur_param;
my @matrix;
while (<DATA>) {
  chomp;
  s/\/\/.*$//;
  next if /^\s*$/;

  if (/^data for the param: (\d+)/) {
    flush_output($cur_param, \@matrix) if defined $cur_param;
    $cur_param = $1;
    @matrix = (); # reset
    # skip the line with number of rows, we're smarter than that
    my $tmp = <DATA>;
    next;
  }

  (my $x, my $y, undef, my @data) = split /\s+/, $_;
  $matrix[$x][$y] ||= [];
  push @{$matrix[$x][$y]}, @data;
}

sub flush_output {
  my $cur_param = shift;
  my $matrix = shift;
  # in reality: open file and dump
  # ... while dumping, do an ||= [9999] for the default...

  # here: simple debug output:
  use Data::Dumper;
  print "\nPARAM $cur_param\n";
  print Dumper $matrix;
}

__DATA__
data for the param: 2
5559
// (x,y) count values
280 40 3  0 0 0 
280 41 4  0 0 0 0 
280 42 5  0 0 0 0 0 
281 43 4  0 0 10 10 
281 44 4  0 0 10 10 
281 45 4  0 0 0 10 
281 46 4  0 0 10 0 
281 47 4  0 0 10 0 
281 48 3  10 10 0 
281 49 2  0 0 
41 50 3  0 0 0 
45 50 3  0 0 0 
280 50 2  0 0 
40 51 8  0 0 0 0 0 0 0 0

data for the param: 3
3356
// (x,y) count values

回复收藏 0 原文

~没有更多了~