Perl 转换逻辑 - 文件处理或 DB
我正在为一个文件构建转换逻辑,该文件对文件中的字段应用某些转换规则。此类转换规则的示例包括:
- 如果某些字段为空,则为其设置默认值(如果第 5 列为空,则将其设置为“空”)
- 根据某些列汇总文件(如果文件有 col1、col2 和 col3,则汇总文件将 col3 聚合为所有 col1)
- 替换某些字段中的字符串(将 col1 中的所有“ax”替换为“ay”)
- 等等。
从性能角度来看,在大文件上执行这些转换时,最好使用纯文件处理(逐行读取文件,使用哈希进行汇总,正则表达式进行其他转换等)或将数据加载到数据库表中,汇总并应用所有转换逻辑并将其下载回文件?
总结例如:
原始文件有:
A|B|C|100|200|300
A|B|C|200|100|0
A|X|C|100|100|100
转换后的文件有:
A|B|300 |300|300
A|X|100|100|100
I am building the transformation logic for a file that applies certain transformation rules on the fields in the file. Example of such transformation rules are:
- Setting default values for certain fields if they are empty (if col 5 is empty, set it to "Empty")
- Summarizing the file based on certain columns (if file has col1, col2 and col3, summarize file to aggregate col3 for all col1)
- Substituting strings in certain fields (replace all "ax" in col1 with "ay")
- Etc.
From a performance perspective, when dealing with performing these transformations on a large file, is it better to do it using plain file processing (read file line by line, use hashes for summarizing, regexes for other transformations etc.) or load the data into a database table, summarize and apply all transform logic and download it back to a file?
Eg of summarization:
Original file has:
A|B|C|100|200|300
A|B|C|200|100|0
A|X|C|100|100|100
Transformed file has:
A|B|300|300|300
A|X|100|100|100
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您给出的数据,这个问题完全在 Perl 的掌握范围内,无需数据库:
打印:
您当然需要在剩余的转换中进行编码,但这应该给您一个开始。即使事实证明您需要多次访问原始行,假设您的数据不是很大,您也可以将其加载到二维数组中,然后对其进行遍历。或者,您可以使用
Tie::File
访问非常大的文件,而无需将其全部读入。Assuming the data you have given, this problem is well within Perl's grasp without a database:
which prints:
You will of course need to code in the remaining transforms, but this should give you a start. Even if it turns out you need to access the raw rows more than once, assuming your data is not gigantic, you could load it into a two dimensional array, and then run your passes over it. Or you could use
Tie::File
to access a very large file without reading it all in.最好的解决方案是以两种方式对系统进行编码并进行测量以确定哪种更好。
The best solution would be to code the system both ways and take measurements to decide which is better.