处理非平面文件中的文本(提取信息,就好像它是平面文件一样)

发布于 2024-08-21 01:48:52 字数 1352 浏览 13 评论 0原文

我有一个由计算机模拟生成的纵向数据集,可以用下表表示(“var”是变量):

time subject var1 var2 var3
t1   subjectA  ...
t2   subjectB  ...

并且

subject   name
subjectA  nameA
subjectB  nameB

生成的文件以类似于以下的格式写入数据文件:

time t1 
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
time t2
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
...(and so on)

我一直在使用(python) 脚本将此输出数据处理为平面文本文件,以便我可以将其导入到 R、python、SQL 或 awk/grep 中以提取信息 - 单个查询所需信息类型的示例(在SQL 表示法(数据转换为表格后)如下所示:

SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

我想知道是否有更有效的解决方案,因为每个数据文件都可以约为 100MB(我有数百个)并创建平面文本文件非常耗时,并且会占用额外的硬盘空间并包含冗余信息。理想情况下,我将直接与原始数据集交互以提取我想要的信息,而无需创建额外的平面文本文件...是否有针对此类任务的更简单的 awk/perl 解决方案?我非常精通 Python 中的文本处理,但我的 awk 技能还很初级,而且我没有 Perl 的工作知识;我想知道这些或其他特定领域的工具是否可以提供更好的解决方案。

谢谢!

后记: 哇,谢谢大家!很抱歉我无法选择每个人的答案 @FM:谢谢。我的 Python 脚本类似于您的代码,但没有过滤步骤。但你的组织是干净的。 @PP:我以为我已经精通 grep 但显然不是!这非常有帮助......但我认为将“时间”混合到输出中时 grep 变得很困难(我未能将其作为可能的提取场景包含在我的示例中!这是我的错)。 @ghostdog74:这真是太棒了......但是修改该行以获取“subjectA”并不简单......(尽管我会同时阅读更多关于 awk 的内容,希望稍后我能理解)。 @weismat:说得好。 @S.Lott:这是非常优雅和灵活的 - 我并不是要求 python(ic) 解决方案,但这完全适合 PP 建议的解析、过滤器和输出框架,并且足够灵活,可以容纳许多不同的查询从该分层文件中提取不同类型的信息。

我再次感谢大家——非常感谢。

I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):

time subject var1 var2 var3
t1   subjectA  ...
t2   subjectB  ...

and

subject   name
subjectA  nameA
subjectB  nameB

However, the file generated writes a data file in a format similar to the following:

time t1 
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
time t2
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
...(and so on)

I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:

SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.

Thanks!

Postscript:
Wow, thanks to all! I am sorry that I cannot choose everyone's answers
@FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean.
@PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad).
@ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later).
@weismat: Well stated.
@S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.

Again, I am grateful to everyone - thanks so much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

超可爱的懒熊 2024-08-28 01:48:52

这就是 Python 生成器的全部意义所在。

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

您可以像使用标准 Python 迭代器一样使用它

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.

This is what Python generators are all about.

def read_as_flat( someFile ):
    line_iter= iter(someFile)
    time_header= None
    for line in line_iter:
        words = line.split()
        if words[0] == 'time':
            time_header = [ words[1:] ] # the "time" line
            description= line_iter.next()
            time_header.append( description )
        elif words[0] in subjectNameSet:
            data = line_iter.next()
            yield time_header + data

You can use this like a standard Python iterator

for time, description, var1, var2, var3 in read_as_flat( someFile ):
    etc.
亚希 2024-08-28 01:48:52

如果您想要的只是匹配特定主题时的 var1、var2、var3,那么您可以尝试以下命令:


  grep -A 1 'subjectB'

-A 1 命令行参数指示 grep 打印出匹配行以及匹配行之后的一行(在这种情况下,变量位于主题之后的一行上)。

您可能想要使用 -E 选项让 grep 搜索正则表达式并将主题搜索锚定到行首(例如 grep -A 1 -E '^subjectB ')。

最后,输出现在将包含您想要的主题行和变量行。您可能想隐藏主题行:


  grep -A 1 'subjectB' |grep -v 'subjectB'

您可能希望处理变量行:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

If all you want is var1, var2, var3 upon matching a particular subject then you could try the following command:


  grep -A 1 'subjectB'

The -A 1 command line argument instructs grep to print out the matched line and one line after the matched line (and in this case the variables come on a line after the subject).

You might want to use the -E option to make grep search for a regular expression and anchor the subject search to the beginning-of-line (e.g. grep -A 1 -E '^subjectB').

Finally the output will now consist of the subject line and variable line you want. You may want to hide the subject line:


  grep -A 1 'subjectB' |grep -v 'subjectB'

And you may wish to process the variable line:


  grep -A 1 'subjectB' |grep -v 'subjectB' |perl -pe 's/ /,/g'

山色无中 2024-08-28 01:48:52

最好的选择是修改计算机模拟以产生矩形输出。假设您无法做到这一点,这里有一种方法:

为了能够在 R、SQL 等中使用数据,您需要以某种方式将其从分层转换为矩形。如果您已经有一个可以将整个文件转换为矩形数据集的解析器,那么您已经成功了。下一步是为解析器添加额外的灵活性,以便它可以过滤掉不需要的数据记录。您将拥有一个数据提取实用程序,而不是文件转换器。

下面的示例是 Perl 语言的,但您可以在 Python 中执行相同的操作。总体思路是在 (a) 解析、(b) 过滤和 (c) 输出之间保持清晰的分离。这样,您就拥有了一个灵活的环境,可以根据您的即时数据处理需求轻松添加不同的过滤或输出方法。您还可以设置过滤方法来接受参数(来自命令行或配置文件),以获得更大的灵活性。

use strict;
use warnings;

read_file($ARGV[0], \&check_record);

sub read_file {
    my ($file_name, $check_record) = @_;
    open(my $file_handle, '<', $file_name) or die $!;
    # A data structure to hold an entire record.
    my $rec = {
        time => '',
        desc => '',
        subj => '',
        name => '',
        vars => [],
    };
    # A code reference to get the next line and do some cleanup.
    my $get_line = sub {
        my $line = <$file_handle>;
        return unless defined $line;
        chomp $line;
        $line =~ s/^\s+//;
        return $line;
    };
    # Start parsing the data file.
    while ( my $line = $get_line->() ){
        if ($line =~ /^time (\w+)/){
            $rec->{time} = $1;
            $rec->{desc} = $get_line->();
        }
        else {
            ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/;
            $rec->{vars} = [ split / +/, $get_line->() ];

            # OK, we have a complete record. Now invoke our filtering
            # code to decide whether to export record to rectangular format.
            $check_record->($rec);
        }
    }
}

sub check_record {
    my $rec = shift;
    # Just an illustration. You'll want to parameterize this, most likely.
    write_output($rec)
        if  $rec->{subj} eq 'subjectB'
        and $rec->{time} eq 't1'
    ;
}

sub write_output {
    my $rec = shift;
    print join("\t", 
        $rec->{time}, $rec->{subj}, $rec->{name},
        @{$rec->{vars}},
    ), "\n";
}

The best option would be to modify the computer simulation to produce rectangular output. Assuming you can't do that, here's one approach:

In order to be able to use the data in R, SQL, etc. you need to convert it from hierarchical to rectangular one way or another. If you already have a parser that can convert the entire file into a rectangular data set, you are most of the way there. The next step is to add additional flexibility to your parser, so that it can filter out unwanted data records. Instead of having a file converter, you'll have a data extraction utility.

The example below is in Perl, but you can do the same thing in Python. The general idea is to maintain a clean separation between (a) parsing, (b) filtering, and (c) output. That way, you have a flexible environment, making it easy to add different filtering or output methods, depending on your immediate data-crunching needs. You can also set up the filtering methods to accept parameters (either from command line or a config file) for greater flexibility.

use strict;
use warnings;

read_file($ARGV[0], \&check_record);

sub read_file {
    my ($file_name, $check_record) = @_;
    open(my $file_handle, '<', $file_name) or die $!;
    # A data structure to hold an entire record.
    my $rec = {
        time => '',
        desc => '',
        subj => '',
        name => '',
        vars => [],
    };
    # A code reference to get the next line and do some cleanup.
    my $get_line = sub {
        my $line = <$file_handle>;
        return unless defined $line;
        chomp $line;
        $line =~ s/^\s+//;
        return $line;
    };
    # Start parsing the data file.
    while ( my $line = $get_line->() ){
        if ($line =~ /^time (\w+)/){
            $rec->{time} = $1;
            $rec->{desc} = $get_line->();
        }
        else {
            ($rec->{subj}, $rec->{name}) = $line =~ /(\w+) +(\w+)/;
            $rec->{vars} = [ split / +/, $get_line->() ];

            # OK, we have a complete record. Now invoke our filtering
            # code to decide whether to export record to rectangular format.
            $check_record->($rec);
        }
    }
}

sub check_record {
    my $rec = shift;
    # Just an illustration. You'll want to parameterize this, most likely.
    write_output($rec)
        if  $rec->{subj} eq 'subjectB'
        and $rec->{time} eq 't1'
    ;
}

sub write_output {
    my $rec = shift;
    print join("\t", 
        $rec->{time}, $rec->{subj}, $rec->{name},
        @{$rec->{vars}},
    ), "\n";
}
苦笑流年记忆 2024-08-28 01:48:52

如果你很懒并且有足够的 RAM,那么只要你立即需要它们,我就会在 RAM 磁盘上而不是文件系统上工作。
如果您只是将当前算法重新编码为不同的语言,我认为 Perl 或 awk 不会比 Python 更快。

If you are lazy and have enough RAM, then I would work on a RAM disk instead of the file system as long as you need them immediately.
I do not think that Perl or awk will be faster than Python if you are just recoding your current algorithm into a different language.

苏别ゝ 2024-08-28 01:48:52
awk '/time/{f=0}/subjectB/{f=1;next}f' file
awk '/time/{f=0}/subjectB/{f=1;next}f' file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文