Perl 正则表达式语法
我想使用 Perl 获取之前生成的 SPSS 语法文件并将其格式化以在 R 环境中使用。
对于那些熟悉 Perl 和正则表达式的人来说,这可能是一个非常简单的任务,但我却遇到了困难。
我为此 Perl 脚本列出的步骤如下:
- 读入 SPSS 文件
- 查找 SPSS 文件的适当块(正则表达式)以进行进一步处理和格式化
- 上面提到的进一步处理(更多正则表达式)
- 将 R 语法返回到命令行或最好是一个文件。
SPSS 值标签语法的基本格式是:
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
...Resume nonsense...
我想要的 R 语法如下所示:
gender <- as.factor(gender
, levels= c(1,2)
, labels= c("M","F")
)
...
这是我迄今为止编写的 Perl 脚本。我已成功将每一行读入适当的数组中。我有最终打印功能所需的一般流程,但我需要弄清楚如何只为每个 @vars 数组打印适当的 @levels 和 @labels 数组。
#!/usr/bin/perl
#Need to change to read from argument in command line
open(VARVAL, "append.txt");
@lines = <VARVAL>;
close(VARVAL);
#Read through each line and put into a variable, a value, or a reject
#I really only want to read in everything between "value labels" and "execute ."
#That probably requires more regex...
foreach (@lines){
if ($_ =~ /\//){ #Anything with a / is a variable, remove the / and push
$_ =~ tr/\///d;
push(@vars, $_)
} elsif ($_ =~/\d/) {
push(@vals, $_) #Anything that has a number in the line is a value
}
}
#Splitting each @vals array into levels or labels arrays
foreach (@vals){
@values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character?
foreach (@values) {
if ($_ =~/\d/){
push(@levels, $_);
} else {
push(@labels, $_)
}
}
}
#Get rid of newline
#I should provavly do this somewhere else?
chomp(@vars);
chomp(@levels);
chomp(@labels);
#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup?
#Need to get rid of final comma
#Need to redirect output to a file
foreach (@vars){
print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ;
foreach (@levels){
print $_ . ",";
}
print ")\n\t, labels = c(";
foreach(@labels){
print $_ . ",";
}
print ")\n\t)\n";
}
最后,这是当前运行的脚本的示例输出:
gender <- as.factor(gender
, levels = c(1,2,1,2,3,)
, labels = c("M","F","biz","action","tiddlywinks",)
)
我需要它仅包含级别 1,2 以及标签 M 和 F。
感谢您的帮助!
I would like to use Perl to take a previously generated SPSS syntax file and format it for use in an R environment.
This is probably a very simple task for those familiar with Perl and regex, but I am stumbling.
The steps as I've laid them out for this Perl script are as follows:
- Read in SPSS file
- Find appropriate chunks of SPSS file (regex) for further processing and formatting
- Further processing noted above (more regex)
- Return R syntax to command line or preferably a file.
The basic format of SPSS value labels syntax is:
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
...Resume nonsense...
And the desired R syntax I am after looks like:
gender <- as.factor(gender
, levels= c(1,2)
, labels= c("M","F")
)
...
Here is the Perl script I have written thus far. I have successfully read each line into the appropriate array. I have the general flow of what I need for the final print function, but I need to figure out how to ONLY print the appropriate @levels and @labels arrays for each @vars array.
#!/usr/bin/perl
#Need to change to read from argument in command line
open(VARVAL, "append.txt");
@lines = <VARVAL>;
close(VARVAL);
#Read through each line and put into a variable, a value, or a reject
#I really only want to read in everything between "value labels" and "execute ."
#That probably requires more regex...
foreach (@lines){
if ($_ =~ /\//){ #Anything with a / is a variable, remove the / and push
$_ =~ tr/\///d;
push(@vars, $_)
} elsif ($_ =~/\d/) {
push(@vals, $_) #Anything that has a number in the line is a value
}
}
#Splitting each @vals array into levels or labels arrays
foreach (@vals){
@values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character?
foreach (@values) {
if ($_ =~/\d/){
push(@levels, $_);
} else {
push(@labels, $_)
}
}
}
#Get rid of newline
#I should provavly do this somewhere else?
chomp(@vars);
chomp(@levels);
chomp(@labels);
#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup?
#Need to get rid of final comma
#Need to redirect output to a file
foreach (@vars){
print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ;
foreach (@levels){
print $_ . ",";
}
print ")\n\t, labels = c(";
foreach(@labels){
print $_ . ",";
}
print ")\n\t)\n";
}
And finally, here is sample output from the script as it currently runs:
gender <- as.factor(gender
, levels = c(1,2,1,2,3,)
, labels = c("M","F","biz","action","tiddlywinks",)
)
I need this to only include levels 1,2 and labels M and F.
Thanks for the help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这似乎对我有用:
并且产生:
This seems to work for me:
And yields: