Perl 转换逻辑 - 文件处理或 DB

发布于 2024-10-17 21:55:57 字数 515 浏览 1 评论 0原文

我正在为一个文件构建转换逻辑,该文件对文件中的字段应用某些转换规则。此类转换规则的示例包括:

  • 如果某些字段为空,则为其设置默认值(如果第 5 列为空,则将其设置为“空”)
  • 根据某些列汇总文件(如果文件有 col1、col2 和 col3,则汇总文件将 col3 聚合为所有 col1)
  • 替换某些字段中的字符串(将 col1 中的所有“ax”替换为“ay”)
  • 等等。

从性能角度来看,在大文件上执行这些转换时,最好使用纯文件处理(逐行读取文件,使用哈希进行汇总,正则表达式进行其他转换等)或将数据加载到数据库表中,汇总并应用所有转换逻辑并将其下载回文件?

总结例如:

原始文件有:

A|B|C|100|200|300

A|B|C|200|100|0

A|X|C|100|100|100

转换后的文件有:

A|B|300 |300|300

A|X|100|100|100

I am building the transformation logic for a file that applies certain transformation rules on the fields in the file. Example of such transformation rules are:

  • Setting default values for certain fields if they are empty (if col 5 is empty, set it to "Empty")
  • Summarizing the file based on certain columns (if file has col1, col2 and col3, summarize file to aggregate col3 for all col1)
  • Substituting strings in certain fields (replace all "ax" in col1 with "ay")
  • Etc.

From a performance perspective, when dealing with performing these transformations on a large file, is it better to do it using plain file processing (read file line by line, use hashes for summarizing, regexes for other transformations etc.) or load the data into a database table, summarize and apply all transform logic and download it back to a file?

Eg of summarization:

Original file has:

A|B|C|100|200|300

A|B|C|200|100|0

A|X|C|100|100|100

Transformed file has:

A|B|300|300|300

A|X|100|100|100

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

安人多梦 2024-10-24 21:55:57

假设您给出的数据,这个问题完全在 Perl 的掌握范围内,无需数据库:

my %data;
while (my $line = <DATA>) {
    chomp $line;
    my ($c1, $c2, undef, @cols) = split /\|/, $line, -1;

    $data{"$c1|$c2"}[$_] += $cols[$_] for 0 .. $#cols;
}

print join('|' => $_, @{ $data{$_} }), "\n" for sort keys %data;

__DATA__
A|B|C|100|200|300
A|B|C|200|100|0
A|X|C|100|100|100

打印:

A|B|300|300|300
A|X|100|100|100

您当然需要在剩余的转换中进行编码,但这应该给您一个开始。即使事实证明您需要多次访问原始行,假设您的数据不是很大,您也可以将其加载到二维数组中,然后对其进行遍历。或者,您可以使用 Tie::File 访问非常大的文件,而无需将其全部读入。

Assuming the data you have given, this problem is well within Perl's grasp without a database:

my %data;
while (my $line = <DATA>) {
    chomp $line;
    my ($c1, $c2, undef, @cols) = split /\|/, $line, -1;

    $data{"$c1|$c2"}[$_] += $cols[$_] for 0 .. $#cols;
}

print join('|' => $_, @{ $data{$_} }), "\n" for sort keys %data;

__DATA__
A|B|C|100|200|300
A|B|C|200|100|0
A|X|C|100|100|100

which prints:

A|B|300|300|300
A|X|100|100|100

You will of course need to code in the remaining transforms, but this should give you a start. Even if it turns out you need to access the raw rows more than once, assuming your data is not gigantic, you could load it into a two dimensional array, and then run your passes over it. Or you could use Tie::File to access a very large file without reading it all in.

酒绊 2024-10-24 21:55:57

最好的解决方案是以两种方式对系统进行编码并进行测量以确定哪种更好。

The best solution would be to code the system both ways and take measurements to decide which is better.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文