如何提取嵌入非表格文本文件中的表格数据？

发布于 2024-11-08 16:11:27 字数 675 浏览 10 评论 0原文

有几千个报告，其中嵌入了我需要提取的一致格式的表格数据。

有一些想法，但我想我应该发帖看看是否有比我想的更好的方法；这是提取表格数据，为其创建一个新文件，然后将该数据解析为表格文件。

下面是一个示例输入和输出，其中输出逐行读取并写入数据库。

INPUT_FILE

MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText
SubHeader
PASS    1283019238  alksdjalskdjl
FAIL    102310928301    kajdlkajsldkaj
PASS    102930192830    aoisdajsdoiaj
PASS    192830192301    jiasdojoasi
MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText

OUTPUT（从文本文件逐行读/写到数据库）

ROW-01{column01,column02,column03}
...
ROW-nth{column01,column02,column03}

原文

Have few thousand reports that have consistently formatted tabular data embedded within them that I need to extract.

Have a few ideas, but thought I'd post to see if there's a better way to do this than what I'm thinking; which is to extract the tabular data, create a new file for it, then parse that data as a tabular file.

Here's a sample input and output, where the output read and written row by row to a database.

INPUT_FILE

MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText
SubHeader
PASS    1283019238  alksdjalskdjl
FAIL    102310928301    kajdlkajsldkaj
PASS    102930192830    aoisdajsdoiaj
PASS    192830192301    jiasdojoasi
MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText

OUTPUT (read/write row-by-row from text-file to DB)

ROW-01{column01,column02,column03}
...
ROW-nth{column01,column02,column03}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沐歌 2024-11-15 16:11:27

识别何时开始处理表格数据很容易。你已经有了标记线。困难在于识别何时停止处理数据。当分割未产生预期结果时，您可以应用停止的启发式方法来处理数据。

use strict;
use warnings;
my $tab_data;
my $num_cols;
while ( <> ) {
    $tab_data = 1, next if $_ eq "SubHeader\n";
    next unless $tab_data;
    chomp;
    my @cols = split /\t/;
    $num_cols ||= scalar @cols;
    last if $num_cols and $num_cols != scalar @cols;
    print join( "\t", @cols ), "\n";
}

另存为 etd.pl （etd = 提取表格数据，你觉得怎么样？），然后从命令行调用它：

perl etd.pl < your-mixed-input.txt

Recognizing when to start processing tabular data is easy. You've got the marker line. The difficulty is recognizing when to stop processing data. You can apply the heuristics of stopping to process data when the split doesn't yield the expected result.

use strict;
use warnings;
my $tab_data;
my $num_cols;
while ( <> ) {
    $tab_data = 1, next if $_ eq "SubHeader\n";
    next unless $tab_data;
    chomp;
    my @cols = split /\t/;
    $num_cols ||= scalar @cols;
    last if $num_cols and $num_cols != scalar @cols;
    print join( "\t", @cols ), "\n";
}

Save as etd.pl (etd = extract tabular data, what did you think?), and call it like this from the command line: