Perl 正则表达式语法

发布于 2024-09-11 16:51:15 字数 2442 浏览 10 评论 0原文

我想使用 Perl 获取之前生成的 SPSS 语法文件并将其格式化以在 R 环境中使用。

对于那些熟悉 Perl 和正则表达式的人来说，这可能是一个非常简单的任务，但我却遇到了困难。

我为此 Perl 脚本列出的步骤如下：

读入 SPSS 文件
查找 SPSS 文件的适当块（正则表达式）以进行进一步处理和格式化
上面提到的进一步处理（更多正则表达式）
将 R 语法返回到命令行或最好是一个文件。

SPSS 值标签语法的基本格式是：

...A bunch of nonsense I do not care about...
...
 Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"

execute . 
...Resume nonsense...

我想要的 R 语法如下所示：

gender <- as.factor(gender
    , levels= c(1,2)
    , labels= c("M","F")
    )
...

这是我迄今为止编写的 Perl 脚本。我已成功将每一行读入适当的数组中。我有最终打印功能所需的一般流程，但我需要弄清楚如何只为每个 @vars 数组打印适当的 @levels 和 @labels 数组。

#!/usr/bin/perl

#Need to change to read from argument in command line
open(VARVAL, "append.txt");
@lines = <VARVAL>;
close(VARVAL);

#Read through each line and put into a variable, a value, or a reject
#I really only want to read in everything between "value labels" and "execute ."
#That probably requires more regex...
foreach  (@lines){
    if ($_ =~ /\//){        #Anything with a / is a variable, remove the / and push
        $_ =~ tr/\///d;
        push(@vars, $_)
    } elsif ($_ =~/\d/) {
        push(@vals, $_)    #Anything that has a number in the line is a value
        }
}
#Splitting each @vals array into levels or labels arrays
foreach (@vals){
    @values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character?
    foreach (@values) {
        if ($_ =~/\d/){
            push(@levels, $_);
        } else {
            push(@labels, $_)
        }
    }
}

#Get rid of newline
#I should provavly do this somewhere else?
chomp(@vars);
chomp(@levels);
chomp(@labels);

#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup?
#Need to get rid of final comma
#Need to redirect output to a file
foreach (@vars){
    print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ;
         foreach (@levels){
            print $_ . ",";
         }
    print ")\n\t, labels = c(";
    foreach(@labels){
            print $_ . ",";
        }
    print ")\n\t)\n";
}

最后，这是当前运行的脚本的示例输出：

gender <- as.factor(gender
    , levels = c(1,2,1,2,3,)
    , labels = c("M","F","biz","action","tiddlywinks",)
    )

我需要它仅包含级别 1,2 以及标签 M 和 F。

感谢您的帮助！

原文

I would like to use Perl to take a previously generated SPSS syntax file and format it for use in an R environment.

This is probably a very simple task for those familiar with Perl and regex, but I am stumbling.

The steps as I've laid them out for this Perl script are as follows:

Read in SPSS file
Find appropriate chunks of SPSS file (regex) for further processing and formatting
Further processing noted above (more regex)
Return R syntax to command line or preferably a file.

The basic format of SPSS value labels syntax is:

...A bunch of nonsense I do not care about...
...
 Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"

execute . 
...Resume nonsense...

And the desired R syntax I am after looks like:

gender <- as.factor(gender
    , levels= c(1,2)
    , labels= c("M","F")
    )
...

Here is the Perl script I have written thus far. I have successfully read each line into the appropriate array. I have the general flow of what I need for the final print function, but I need to figure out how to ONLY print the appropriate @levels and @labels arrays for each @vars array.

#!/usr/bin/perl

#Need to change to read from argument in command line
open(VARVAL, "append.txt");
@lines = <VARVAL>;
close(VARVAL);

#Read through each line and put into a variable, a value, or a reject
#I really only want to read in everything between "value labels" and "execute ."
#That probably requires more regex...
foreach  (@lines){
    if ($_ =~ /\//){        #Anything with a / is a variable, remove the / and push
        $_ =~ tr/\///d;
        push(@vars, $_)
    } elsif ($_ =~/\d/) {
        push(@vals, $_)    #Anything that has a number in the line is a value
        }
}
#Splitting each @vals array into levels or labels arrays
foreach (@vals){
    @values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character?
    foreach (@values) {
        if ($_ =~/\d/){
            push(@levels, $_);
        } else {
            push(@labels, $_)
        }
    }
}

#Get rid of newline
#I should provavly do this somewhere else?
chomp(@vars);
chomp(@levels);
chomp(@labels);

#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup?
#Need to get rid of final comma
#Need to redirect output to a file
foreach (@vars){
    print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ;
         foreach (@levels){
            print $_ . ",";
         }
    print ")\n\t, labels = c(";
    foreach(@labels){
            print $_ . ",";
        }
    print ")\n\t)\n";
}

And finally, here is sample output from the script as it currently runs:

gender <- as.factor(gender
    , levels = c(1,2,1,2,3,)
    , labels = c("M","F","biz","action","tiddlywinks",)
    )

I need this to only include levels 1,2 and labels M and F.

Thanks for the help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赏烟花じ飞满天 2024-09-18 16:51:15

这似乎对我有用：

#!/usr/bin/env perl
use strict;
use warnings;

my @lines = <DATA>;

my $current_label = '';
my @ordered_labels;
my %data;
for my $line (@lines) {
    if ( $line =~ /^\/(.*)$/ ) { # starts with slash
        $current_label = $1;
        push @ordered_labels, $current_label;
        next;
    }
    if ( length $current_label ) {
        if ( $line =~ /^(\d) "(.*)"$/ ) {
            $data{$current_label}{$1} = $2;
            next;
        }
    }
}

for my $label ( @ordered_labels ) {
    print "$label <- as.factor($label\n";
    print "    , levels= c(";
    print join(',',map { $_ } sort keys %{$data{$label}} );
    print ")\n";
    print "    , labels= c(";
    print join(',',
        map { '"' . $data{$label}{$_} . '"'  }
        sort keys %{$data{$label}} );
    print ")\n";
    print "    )\n";
}

__DATA__
...A bunch of nonsense I do not care about...
...
 Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"

execute .

并且产生：

gender <- as.factor(gender
    , levels= c(1,2)
    , labels= c("M","F")
    )
purpose <- as.factor(purpose
    , levels= c(1,2,3)
    , labels= c("business","vacation","tiddlywinks")
    )

This seems to work for me:

#!/usr/bin/env perl
use strict;
use warnings;

my @lines = <DATA>;

my $current_label = '';
my @ordered_labels;
my %data;
for my $line (@lines) {
    if ( $line =~ /^\/(.*)$/ ) { # starts with slash
        $current_label = $1;
        push @ordered_labels, $current_label;
        next;
    }
    if ( length $current_label ) {
        if ( $line =~ /^(\d) "(.*)"$/ ) {
            $data{$current_label}{$1} = $2;
            next;
        }
    }
}

for my $label ( @ordered_labels ) {
    print "$label <- as.factor($label\n";
    print "    , levels= c(";
    print join(',',map { $_ } sort keys %{$data{$label}} );
    print ")\n";
    print "    , labels= c(";
    print join(',',
        map { '"' . $data{$label}{$_} . '"'  }
        sort keys %{$data{$label}} );
    print ")\n";
    print "    )\n";
}

__DATA__
...A bunch of nonsense I do not care about...
...
 Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"

execute .

And yields:

gender <- as.factor(gender
    , levels= c(1,2)
    , labels= c("M","F")
    )
purpose <- as.factor(purpose
    , levels= c(1,2,3)
    , labels= c("business","vacation","tiddlywinks")
    )

回复收藏 0 原文

~没有更多了~