在压缩存档内的文本文件上运行“head”,而不解压存档

发布于 2024-09-25 07:10:02 字数 288 浏览 8 评论 0原文

您好,

我已经接手了之前的团队并编写了处理 csv 文件的 ETL 作业。我在 ubuntu 上结合使用 shell 脚本和 perl。 csv 文件很大;它们以压缩档案形式到达。解压后,许多都超过 30Gb - 是的,这是一个 G

Legacy 进程,是一个在 cron 上运行的批处理作业,它完全解压每个文件,读取第一行并将其复制到配置文件中,然后重新压缩整个文件。有时,这需要花费许多小时的处理时间,但没有任何好处。

您能否建议一种方法,仅从压缩存档内的每个文件中提取第一行(或前几行),而不完全解压存档?

Greetings,

I've taken over from a prior team and writing ETL jobs which process csv files. I use a combination of shell scripts and perl on ubuntu. The csv files are huge; they arrive as zipped archives. Unzipped, many are more than 30Gb - yes, that's a G

Legacy process is a batch job running on cron that unzips each file entirely, reads and copies the first line of it into a config file, then re-zips the entire file. Some days this takes many many hours of processing time, for no benefit.

Can you suggest a method to only extract the first line (or first few lines) from each file inside a zipped archive, without fully unpacking the archives?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

独守阴晴ぅ圆缺 2024-10-02 07:10:02

unzip 命令行实用程序有一个 -p 选项将文件转储到标准输出。只需将其输入 head 即可,无需提取整个文件到磁盘。

或者,从 perldoc IO::Compress::Zip

my ($status, $bufferRef);
my $member = $zip->memberNamed( 'xyz.txt' );
$member->desiredCompressionMethod( COMPRESSION_STORED );
$status = $member->rewindData();
die "error $status" unless $status == AZ_OK;
while ( ! $member->readIsDone() )
{
   ( $bufferRef, $status ) = $member->readChunk();
   die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END;
   # do something with $bufferRef:
   print $bufferRef;
}
$member->endRead();

修改以适应,即通过迭代文件列表$zip->memberNames(),并且仅读取前几行。

The unzip command line utility has a -p option which dumps a file to standard out. Just pipe that into head and it'll not bother extracting the whole file to disk.

Alternatively, from perldoc IO::Compress::Zip:

my ($status, $bufferRef);
my $member = $zip->memberNamed( 'xyz.txt' );
$member->desiredCompressionMethod( COMPRESSION_STORED );
$status = $member->rewindData();
die "error $status" unless $status == AZ_OK;
while ( ! $member->readIsDone() )
{
   ( $bufferRef, $status ) = $member->readChunk();
   die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END;
   # do something with $bufferRef:
   print $bufferRef;
}
$member->endRead();

Modify to suit, i.e. by iterating over the file list $zip->memberNames(), and only reading the first few lines.

十年九夏 2024-10-02 07:10:02

Python 的 zipfile.ZipFile 允许您访问通过 ZipFile.open() 将文件归档为流。从那里您可以根据需要处理它们。

Python's zipfile.ZipFile allows you to access archived files as streams via ZipFile.open(). From there you can process them as necessary.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文