在 Perl 中解析不规则文本文件

发布于 2024-11-08 20:55:13 字数 681 浏览 0 评论 0原文

我是 Perl 编程新手，想了解如何使用 Perl 解析文本文件。我有一个文本文件，其中格式不规则，我想将其解析为三个。

基本上，该文件包含与以下内容类似的文本：

;out;asoljefsaiouerfas'pozsirt'z
mysql_query("SELECT * FROM Table WHERE (value='true') OR (value2='true') OR (value3='true') ");
1234 434 3454

4if[9put[e]9sd=09q]024s-q]3-=04i
select ta.somefield, tc.somefield 
from TableA ta INNER JOIN TableC tc on tc.somefield=ta.somefield 
INNER JOIN TableB tb on tb.somefield=ta.somefield 
ORDER by tb.somefield
234 4536 234

并且这种格式的列表还有很多。

所以我需要做的就是把它分成三部分来解析。即最上面的那个，进行哈希检查。第二个是 mysql 查询，第三个是解析这三个数字。由于某种原因，我不知道如何做到这一点。我使用 perl 中的“open”函数从文本文件中获取数据。然后我尝试使用“分割”函数来换行，但结果发现查询不是在一行或一个模式中，所以我不能像我想象的那样使用它。

原文

I am new to perl programming and would like to know about parsing text files with perl.
I have a text file that has irregular formatting in it and I would like to parse it into three.

Basically the file includes text similar to these:

;out;asoljefsaiouerfas'pozsirt'z
mysql_query("SELECT * FROM Table WHERE (value='true') OR (value2='true') OR (value3='true') ");
1234 434 3454

4if[9put[e]9sd=09q]024s-q]3-=04i
select ta.somefield, tc.somefield 
from TableA ta INNER JOIN TableC tc on tc.somefield=ta.somefield 
INNER JOIN TableB tb on tb.somefield=ta.somefield 
ORDER by tb.somefield
234 4536 234

and the list goes on with this format.

So what I need to do is to parse it in three. Namely the one on top, getting hash checks. The second is the mysql query and third would be to parse the three numbers. For some reason I do not get how to do this. I use the 'open' function in perl to get the data from the text file. And then I try to use the 'split' function for the line breaks but turns out the queries aren't in a single line or in a pattern so I can't use it that way as I have figured.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

べ繥欢鉨o。 2024-11-15 20:55:13

假设：

数据块之间会有一个空行。
该空白行将仅包含一个换行符。
在这些块中，哈希检查将在顶部单行，三个数字将在底部单行。

考虑到这一点：

use strict;
use warnings;
use English qw<$RS $OS_ERROR>;

local $RS = "\n\n";

open( my $fh, '<', $path_to_file ) 
    or die "Could not open $path_to_file! - $OS_ERROR"
    ;
while ( <> ) { 
    chomp;
    my ( $hash_check_line
       , @inner_lines 
       )
       = split /\n/
       ;
    my @numbers = split /\D+/, pop @inner_lines;
    my $sql     = join( "\n", @inner_lines );

    ...
}

通过更改 $RS （ $/ 或 $INPUT_RECORD_SEPARATOR ）到双换行符，我们改变记录的读入方式。

这并不奇怪，但在我使用 Perl 的这些年里，我不得不做出记录分隔符是一些非常有趣的字符串，但有时只需要读取您想要读取的块即可。

Assumptions:

There will be a blank line between chunks of data.
That blank line will consist of only a newline.
In these chunks the hash checks will be the top single line, and the three numbers will be the bottom single line.

with that in mind:

use strict;
use warnings;
use English qw<$RS $OS_ERROR>;

local $RS = "\n\n";

open( my $fh, '<', $path_to_file ) 
    or die "Could not open $path_to_file! - $OS_ERROR"
    ;
while ( <> ) { 
    chomp;
    my ( $hash_check_line
       , @inner_lines 
       )
       = split /\n/
       ;
    my @numbers = split /\D+/, pop @inner_lines;
    my $sql     = join( "\n", @inner_lines );

    ...
}

By changing the $RS ( $/ or $INPUT_RECORD_SEPARATOR ) to double newlines, we change how records are read in.

This is not so bizarre, but in my years with Perl, I have had to make the record separator some pretty interesting strings, but sometimes it's all it takes to read in just the chunk that you want to read.

回复收藏 0 原文

寻找一个思念的角度 2024-11-15 20:55:13

哦，天哪。

我看到的算法是：

缓存第一行。
读取所有行，直到出现空行。
“最后”行将是数字。
剩下的一切将是查询。

考虑到这一点，我提供了以下代码：

open my $fh, '<', $path_to_file
    or die "Can't open $path_to_file: $!";
while (my ($checksum, $query, $numbers) = read_record($fh) ) {
    # do something with record
}
close $fh or warn "$!";

sub read_record {
    my $fh = shift;
    my @lines;
    LINE: while (my $line = <$fh>) {
        chomp $line;
        last LINE if $line eq q{}; # if empty, we're done with the record!
        push @lines, $line;        # store it :)
    }
    return unless @lines;          # if we didn't get anything, eof!
    my $checksum = shift @lines;   # first was checksum.
    my $numbers = pop @lines;      # last thing read was numbers.
    my $query = join ' ', @lines;  # everything else, query.
    return ($checksum, $query, $numbers);
}

当然，要进行修改以适应边界条件。

Oh, oh GOD.

The algorithm I see is:

Cache the first line.
Read all the lines until a blank line.
THe 'last' line will be numbers.
All the rest will be the query.

With that in mind, I present the following code:

open my $fh, '<', $path_to_file
    or die "Can't open $path_to_file: $!";
while (my ($checksum, $query, $numbers) = read_record($fh) ) {
    # do something with record
}
close $fh or warn "$!";

sub read_record {
    my $fh = shift;
    my @lines;
    LINE: while (my $line = <$fh>) {
        chomp $line;
        last LINE if $line eq q{}; # if empty, we're done with the record!
        push @lines, $line;        # store it :)
    }
    return unless @lines;          # if we didn't get anything, eof!
    my $checksum = shift @lines;   # first was checksum.
    my $numbers = pop @lines;      # last thing read was numbers.
    my $query = join ' ', @lines;  # everything else, query.
    return ($checksum, $query, $numbers);
}

Modify, of course, to suit boundary conditions.

回复收藏 0 原文

赢得她心 2024-11-15 20:55:13

以下似乎有效：

while ($file_content =~ /\s*^(.+?)^(.*?)^(\d+\s+\d+\s+\d+)$/smg) {
    my $checksum = $1;
    my $query = $2;
    my $numbers = $3;
    # do stuff
}

这是正则表达式的解释：

\s*                   # eat up empty lines
^(.+?)                # save the checksum line to group 1
^(.+?)                # save one or multiple query lines to group 2
^(\d+\s+\d+\s+\d+)$   # save number line to group 3

第一组始终只有一行，因为遇到下一行时它很懒，正则表达式将尝试从第二组开始匹配。此时，如果可以完成比赛的其余部分，则第二组将包含数字之前的所有后续行。

The following seems to work:

while ($file_content =~ /\s*^(.+?)^(.*?)^(\d+\s+\d+\s+\d+)$/smg) {
    my $checksum = $1;
    my $query = $2;
    my $numbers = $3;
    # do stuff
}

Here is an explanation for the regex:

\s*                   # eat up empty lines
^(.+?)                # save the checksum line to group 1
^(.+?)                # save one or multiple query lines to group 2
^(\d+\s+\d+\s+\d+)$   # save number line to group 3

The first group will always only be one line, since it is lazy when the next line is encountered the regex will try to start matching at the second group. At that point if the rest of the match can be completed that second group will contain all subsequent lines before the numbers.

回复收藏 0 原文

~没有更多了~