有没有一种简单的方法将文本文件分成大括号平衡的部分？

发布于 2024-07-23 13:22:29 字数 507 浏览 11 评论 0原文

我正在尝试使用 Perl & 从文件中解析一些数据。解析::RecDescent。我无法将完整的数据文件扔到 perl 脚本中，因为 RecDescent 需要花费几天的时间来仔细研究它。因此，我将巨大的数据文件分成 RD 大小的块，以减少运行时间。

但是，我需要提取平衡括号内的部分，而我现在的例程并不健壮（它过多地依赖于换行符中最后一个右括号的位置）。示例：

cell ( identifier ) {
  keyword2 { };
  ...
  keyword3 { keyword4 {  } };
}

...more sections...

我需要抓取从 cell ... { 到匹配的结束 } 的所有内容，这些内容可以具有不同数量的间距和子部分。

一定有一些 linux 命令行可以轻松做到这一点吗？有任何想法吗？

编辑：输入文件约为 8M，语法约 60 条规则。

原文

I'm trying to parse some data out of a file using Perl & Parse::RecDescent. I can't throw the full data file at the perl script because RecDescent will take days poring over it. So I split up the huge datafile into RD-sized chunks to reduce the runtime.

However, I need to extract sections within balanced brackets and the routine I have now is not robust (it depends too much on the position of the final close-bracket from a newline). Example:

cell ( identifier ) {
  keyword2 { };
  ...
  keyword3 { keyword4 {  } };
}

...more sections...

I need to grab everything from cell ... { to the matching closing } which can have various amounts of spacing and sub-sections.

There must be some linux command line thing to do this easily? Any ideas?

Edit: Input files are around 8M, grammar ~60 rules.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鞋纸虽美，但不合脚ㄋ〞 2024-07-30 13:22:30

为什么 RecDescent 需要这么长时间？是因为你的语法很复杂吗？如果是这种情况，您可以使用 Parse::RecDescent 进行两次双层传递。这个想法是，您将定义一个简单的语法来解析 cell ... { ... }，然后将第一个解析器的解析输出传递到使用更复杂的语法对 Parse::RecDescent 的调用中。这是对 RecDescent 数据处理缓慢的原因的猜测。

另一种选择是编写自己的简单解析器，该解析器在单元格条目上进行匹配，计算到目前为止看到的大括号数量，然后在右大括号计数等于左大括号计数时找到匹配的大括号。这应该很快，但是上面的建议可能会更快地实现并且更容易维护。

编辑：您绝对应该尝试使用简化语法的 Parse::RecDescent 。递归下降解析的算法复杂度与可能的解析树的数量成正比，应该类似于 B ^ N，其中 B 是语法中的分支点数量，N 是节点数量。

如果您想尝试使用自己的简单解析器来首次遍历您的输入，以下代码可以帮助您入门。

#!/usr/bin/perl -w

use strict;

my $input_file = "input";
open FILE, "<$input_file" or die $!;

my $in_block = 0;
my $current_block = '';
my $open_bracket_count = 0;
while( my $line = <FILE> ) {
    if ( $line =~ /cell/ ) {
        $in_block = 1;
    }

    if ( $in_block ) {
        while ( $line =~ /([\{\}]{1})/g ) {
            my $token = $1;
            if ( $token eq '{' ) {
                $open_bracket_count++;
            } elsif ( $token eq '}' ) {
                $open_bracket_count--;
            }
        }

        $current_block .= $line;
    }

    if ( $open_bracket_count == 0 && $current_block ne '' ) {
        print '-' x 80, "\n";
        print $current_block, "\n";
        $in_block = 0;
        $current_block = '';
    }
}
close FILE or die $!;

编辑：更改代码以避免将整个文件放入内存中。虽然这对于 8MB 文件来说是微不足道的，但逐行读取文件会更干净。

Why does RecDescent take so long? Is it because your grammar is complex? If that's the case, you could two a bi-level pass using Parse::RecDescent. The idea is that you would define a simple grammar that parses cell ... { ... } and then passes parsed output from the first parser into a call to Parse::RecDescent with your more complex grammar. This is guessing about the reason for RecDescent being slow on your data.

Another option is to write your own simple parser that matches on the cell entries, counts the number of braces it's seen so far, and then finds the matching brace when the closing brace count is equal to the opening brace count. That should be fast, but the suggestion above might be faster to implement and easier to maintain.

Edit: You should definitely try Parse::RecDescent with a simplified grammar. The algorithmic complexity of recursive descent parsing is proportional to the number of possible parse trees, which should be something like is B ^ N, where B is the number of branching points in your grammar, and N is the number of nodes.

If you'd like to try rolling your own simple parser for a first pass over your input, the following code can get you started.

#!/usr/bin/perl -w

use strict;

my $input_file = "input";
open FILE, "<$input_file" or die $!;

my $in_block = 0;
my $current_block = '';
my $open_bracket_count = 0;
while( my $line = <FILE> ) {
    if ( $line =~ /cell/ ) {
        $in_block = 1;
    }

    if ( $in_block ) {
        while ( $line =~ /([\{\}]{1})/g ) {
            my $token = $1;
            if ( $token eq '{' ) {
                $open_bracket_count++;
            } elsif ( $token eq '}' ) {
                $open_bracket_count--;
            }
        }

        $current_block .= $line;
    }

    if ( $open_bracket_count == 0 && $current_block ne '' ) {
        print '-' x 80, "\n";
        print $current_block, "\n";
        $in_block = 0;
        $current_block = '';
    }
}
close FILE or die $!;

Edit: changed code to avoid slurping the entire file into memory. While this is trivial for an 8MB file, it's cleaner to just read the file in line-by-line.

回复收藏 0 原文