使用 Perl 解析文本文件的最有效方法是什么?
虽然这是非常基本的问题,但我找不到类似的问题,所以如果您知道 SO 上现有的问题/解决方案,请链接到一个问题。
我有一个大约 2MB、大约 16,000 行长的 .txt
文件。每个记录长度为 160 个字符,阻塞因子为 10。这是一种较旧的数据结构类型,几乎看起来像制表符分隔的文件,但通过单字符/空格分隔。
首先,我 glob
一个用于 .txt
文件的目录 - 该目录中一次不会有多个文件,因此这一尝试本身可能效率低下。
my $txt_file = glob "/some/cheese/dir/*.txt";
然后我用这一行打开文件:
open (F, $txt_file) || die ("Could not open $txt_file");
根据该文件的数据字典,我在 while 循环内使用 Perl 的 substr()
函数解析每行中的每个“字段”。
while ($line = <F>)
{
$nom_stat = substr($line,0,1);
$lname = substr($line,1,15);
$fname = substr($line,16,15);
$mname = substr($line,31,1);
$address = substr($line,32,30);
$city = substr($line,62,20);
$st = substr($line,82,2);
$zip = substr($line,84,5);
$lnum = substr($line,93,9);
$cl_rank = substr($line,108,4);
$ceeb = substr($line,112,6);
$county = substr($line,118,2);
$sex = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major = substr($line,122,3);
$acad_idx = substr($line,125,3);
$gpa = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}
This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.
谁能建议一种更有效/首选的方法?
Although this is pretty basic, I can't find a similar question, so please link to one if you know of an existing question/solution on SO.
I have a .txt
file that is about 2MB and about 16,000 lines long. Each record length is 160 characters with a blocking factor of 10. This is an older type of data structure which almost looks like a tab-delimited file, but the separation is by single-chars/white-spaces.
First, I glob
a directory for .txt
files - there is never more than one file in the directory at a time, so this attempt may be inefficient in itself.
my $txt_file = glob "/some/cheese/dir/*.txt";
Then I open the file with this line:
open (F, $txt_file) || die ("Could not open $txt_file");
As per the data dictionary for this file, I'm parsing each "field" out of each line using Perl's substr()
function within a while loop.
while ($line = <F>)
{
$nom_stat = substr($line,0,1);
$lname = substr($line,1,15);
$fname = substr($line,16,15);
$mname = substr($line,31,1);
$address = substr($line,32,30);
$city = substr($line,62,20);
$st = substr($line,82,2);
$zip = substr($line,84,5);
$lnum = substr($line,93,9);
$cl_rank = substr($line,108,4);
$ceeb = substr($line,112,6);
$county = substr($line,118,2);
$sex = substr($line,120,1);
$grant_type = substr($line,121,1);
$int_major = substr($line,122,3);
$acad_idx = substr($line,125,3);
$gpa = substr($line,128,5);
$hs_cl_size = substr($line,135,4);
}
This approach takes a lot of time to process each line and I'm wondering if there is a more efficient way of getting each field out of each line of the file.
Can anyone suggest a more efficient/preferred method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在我看来,您正在使用固定宽度字段。这是真的吗?如果是,则
unpack
函数就是您所需要的。您提供字段的模板,它将从这些字段中提取信息。有一个教程可用,模板信息可以在pack
这是unpack
的逻辑逆。作为一个简单的基本示例:其中“A”表示任何文本字符(据我所知),数字是多少。有些人使用它时
解压
是一门艺术,但我相信这足以满足基本使用。It looks to me that you are working with fixed width fields here. Is that true? If it is, the
unpack
function is what you need. You provide the template for the fields and it will extract the info from those fields. There is a tutorial available, and the template information is found in the documentation forpack
which isunpack
's logical inverse. As a basic example simply:where 'A' means any text character (as I understand it) and the number is how many. There is quite an art to
unpack
as some people use it, but I believe this will suffice for basic use.使用
/o
选项编译和缓存的单个正则表达式是最快的方法。我使用 Benchmark 模块以三种方式运行您的代码,结果如下:输入是一个包含 20k 行的文件,每行有相同的 160 个字符(字符
0123456789
重复 16 次)。因此它的输入大小与您正在使用的数据相同。Benchmark::cmpthese()
方法输出从最慢到最快的子例程调用。第一列告诉我们子例程每秒可以运行多少次。 正则表达式方法是最快的。不像我之前所说的那样解压。对此感到抱歉。基准代码如下。打印语句作为健全性检查。这是为 darwin-thread-multi-2level 构建的 Perl 5.10.0。
A single regular expression, compiled and cached using the
/o
option, is the fastest approach. I ran your code three ways using the Benchmark module and came out with:Input was a file with 20k lines, each line had the same 160 characters on it (16 repetitions of the characters
0123456789
). So it's the same input size as the data you're working with.The
Benchmark::cmpthese()
method outputs the subroutine calls from slowest to fastest. The first column is telling us how many times per second the sub-routine can be run. The regular expression approach is fastest. Not unpack as I state previously. Sorry about that.The benchmark code is below. The print statements are there as sanity checks. This was with Perl 5.10.0 built for darwin-thread-multi-2level.
对每一行进行拆分,如下所示:
然后使用您的值。
Do a split on each line, like this:
and then work with your values.
你可以这样做:
我认为它比你的 substr 建议更快,但我不确定它是否是最快的解决方案,但我认为它很可能是。
You could do something like:
I think it's faster than your substr suggestion, but I'm not sure whether it's the fastest solution, but I think it might very well be.