如何在 Perl 中从固定宽度格式中提取列？

发布于 2024-08-06 02:23:31 字数 613 浏览 8 评论 0原文

我正在编写一个 Perl 脚本来运行并抓取各种数据元素，例如：

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

我可以毫无问题地抓取该文本文件的每一行。

我有工作正则表达式来获取每个字段。一旦我将行放入变量中，即 $line - 如何获取每个字段并将它们放入自己的变量中，即使它们具有不同的分隔符？

原文

I'm writing a Perl script to run through and grab various data elements such as:

1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000 
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I can grab each line of this text file no problem.

I have working regex to grab each of those fields. Once I have the line in a variable, i.e. $line - how can I grab each of those fields and place them into their own variables even though they have different delimiters?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

油饼 2024-08-13 02:23:31

此示例说明如何使用空格作为分隔符 (split) 或使用固定列布局（unpack）。使用 unpack 如果您使用大写字母（A10 等），将为您删除空格。注意：正如 brian d foy 指出的那样，split 方法对于缺少字段（例如第二行数据）的情况效果不佳，因为该字段位置信息将会丢失； unpack 是此处的方法，除非我们误解了您的数据。

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

This example illustrates how to parse the line either with whitespace as the delimiter (split) or with a fixed-column layout (unpack). With unpack if you use upper-case (A10 etc), whitespace will be removed for you. Note: as brian d foy points out, the split approach does not work well for a situation with missing fields (for example, the second line of data), because the field position information will be lost; unpack is the way to go here, unless we are misunderstanding your data.

use strict;
use warnings;

while (my $line = <DATA>){
    chomp $line;
    my @fields_whitespace = split m'\s+', $line;
    my @fields_fixed = unpack('a10 a10 a12 a28', $line);
}

__DATA__
1253592000                                                  
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

回复收藏 0 原文

孤独陪着我 2024-08-13 02:23:31

使用我的模块 DataExtract::FixedWidth 。它是功能最齐全且经过充分测试的，适用于在 Perl 中使用固定宽度列。如果这还不够快，您可以传入 unpack_string 并消除对边界进行启发式检测的需要。

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

Use my module DataExtract::FixedWidth. It is the most full featured, and well tested, for working with Fixed Width columns in perl. If this isn't fast enough you can pass in an unpack_string and eliminate the need for heuristic detection of boundaries.

#!/usr/bin/env perl
use strict;
use warnings;
use DataExtract::FixedWidth;
use feature ':5.10';

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
});

say join ('|',  @{$de->parse($_)}) for @rows;

    --alternatively if you want header info--

my @rows = <DATA>;
my $de = DataExtract::FixedWidth->new({
  heuristic => \@rows
  , header_row => undef
  , cols => [qw/timestamp field2 period field4/]
});

use Data::Dumper;
warn Dumper $de->parse_hash($_) for @rows;

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200  36.000000      86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

回复收藏 0 原文

佞臣 2024-08-13 02:23:31

我不确定列名称和格式，但您应该能够使用 Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

I'm unsure of the column names and formatting but you should be able to adjust this recipe to your liking using Text::FixedWidth

use strict;
use warnings;
use Text::FixedWidth;

my $fw = Text::FixedWidth->new;
$fw->set_attributes(
    qw(
        timestamp undef  %10s
        field2    undef  %10s
        period    undef  %12s
        field4    undef  %28s
        )
);

while (<DATA>) {
    $fw->parse( string => $_ );
    print $fw->get_timestamp . "\n";
}

__DATA__
1253592000
1253678400                 86400                 6183.000000
1253764800                 86400                 4486.000000
1253851200 36.000000       86400                10669.000000
1253937600  0.000000       86400                 9126.000000
1254024000  0.000000       86400                 2930.000000
1254110400  0.000000       86400                 2895.000000
1254196800  0.000000                             8828.000000

回复收藏 0 原文

攒眉千度 2024-08-13 02:23:31

您可以分割线。看起来你的分隔符只是空格？您可以执行以下操作：

@line = split(" ", $line);

这将匹配所有空白。然后，您可以进行边界检查并通过 $line[0]、$line[1] 等访问每个字段。Split

还可以采用正则表达式而不是字符串作为分隔符。

@line = split(/\s+/, $line);

这可能会做同样的事情。

You can split the line. It appears that your delimiter is just whitespace? You can do something on the order of:

@line = split(" ", $line);

This will match all whitespace. You can then do bounds checking and access each field via $line[0], $line[1], etc.

Split can also take a regular expression rather than a string as a delimiter as well.

@line = split(/\s+/, $line);

This might do the same thing.

回复收藏 0 原文

怪我入戏太深 2024-08-13 02:23:31

如果所有字段都具有相同固定宽度并使用空格格式化，则可以使用以下分割：

@array = split / {1,N}/, $line;

其中N是字段的with 。这将为每个空字段产生一个空间。

If all fields have the same fixed width and are formatted with spaces, you can use the following split:

@array = split / {1,N}/, $line;

where N is the with of the field. This will yield a space for each empty field.

回复收藏 0 原文

栀梦 2024-08-13 02:23:31

固定宽度定界可以这样完成：

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

我的 Perl 非常生疏，所以我确信那里存在语法错误。但这就是要点。

Fixed width delimiting can be done like this:

my @cols;
my %header;
$header{field1} = 0; // char position of first char in field
$header{field2} = 12;
$header{field3} = 15;

while(<IN>) {

   print chomp(substr $_, $header{field2}, $header{field3}); // value of field2 


}

My Perl is very rusty so I am sure there are syntax errors there. but that is the gist of it.

回复收藏 0 原文

~没有更多了~