将文件内容发送到管道并一步计算 # 行的 Pythonic 方式
给定 > 4gb 文件 myfile.gz,我需要将其 zcat 到管道中以供 Teradata 的快速加载使用。我还需要计算文件中的行数。理想情况下,我只想对文件进行一次遍历。我使用 awk 将整行 ($0) 输出到 stdout,并通过使用 awk 的 END 子句,将行数(awk 的 NR 变量)写入另一个文件描述符 (outfile)。
我已经设法使用 awk 做到这一点,但我想知道是否存在更Pythonic 的方法。
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
该管道随后被对 teradata 的 fastload 的调用所消耗,该调用从
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This Works 中读取,但我想知道是否可以跳过 awk 并更好地利用 python。我也愿意接受其他方法。我有多个大文件需要以这种方式处理。
given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
不需要
zcat
或 Awk。计算 gzip 压缩文件中的行数可以通过以下方式完成:如果您想对行执行其他操作,例如将它们传递到不同的进程,请执行
There's no need for either of
zcat
or Awk. Counting the lines in a gzipped file can be done withIf you want to do something else with the lines, such as pass them to a different process, do
使用 Python 及其标准库可以轻松完成行计数和解压缩
gzip
压缩文件。您可以一次性完成所有操作:我不知道如何调用 Fastload - 在调用中替换正确的参数。
Counting lines and unzipping
gzip
-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.
这可以通过一行简单的 bash 完成:
这将在 stderr 上打印行数。如果您想要它在其他地方,您可以根据需要重定向 wc 输出。
This can be done in one simple line of bash:
This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.
实际上,根本不可能将数据通过管道传输到 Fastload,因此如果有人可以在这里发布一个确切的示例,那就太好了。
来自有关 Fastload 配置的 Teradata 文档 http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=文件名
指定包含输入数据的数据源名称的关键字短语。 fileid 必须引用常规文件。 具体来说,不支持管道。
Actually, it should not be possible to pipe the data to Fastload at all, so it would be great if somebody post here an exact example if he could.
From Teradata documentation on the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=filename
Keyword phrase specifying the name of the data source that contains the input data. fileid must refer to a regular file. Specifically, pipes are not supported.