将文件内容发送到管道并一步计算 # 行的 Pythonic 方式

发布于 2024-12-21 04:34:03 字数 770 浏览 6 评论 0原文

给定 > 4gb 文件 myfile.gz，我需要将其 zcat 到管道中以供 Teradata 的快速加载使用。我还需要计算文件中的行数。理想情况下，我只想对文件进行一次遍历。我使用 awk 将整行 ($0) 输出到 stdout，并通过使用 awk 的 END 子句，将行数（awk 的 NR 变量）写入另一个文件描述符 (outfile)。

我已经设法使用 awk 做到这一点，但我想知道是否存在更Pythonic 的方法。

#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path

the_file = "/path/to/file/myfile.gz"

outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)

该管道随后被对 teradata 的 fastload 的调用所消耗，该调用从

"/dev/fd/" + str(zcat_proc.stdout.fileno())

This Works 中读取，但我想知道是否可以跳过 awk 并更好地利用 python。我也愿意接受其他方法。我有多个大文件需要以这种方式处理。

原文

given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).

I've managed to do this using awk but I'd like to know if a more pythonic way exists.

#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path

the_file = "/path/to/file/myfile.gz"

outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)

The pipe is later consumed by a call to teradata's fastload, which reads from

"/dev/fd/" + str(zcat_proc.stdout.fileno())

This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

层林尽染 2024-12-28 04:34:03

不需要 zcat 或 Awk。计算 gzip 压缩文件中的行数可以通过以下方式完成：

import gzip

nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))

如果您想对行执行其他操作，例如将它们传递到不同的进程，请执行

nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
    nlines += 1
    # pass the line to the other process

There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with

import gzip

nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))

If you want to do something else with the lines, such as pass them to a different process, do

nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
    nlines += 1
    # pass the line to the other process

回复收藏 0 原文

护你周全 2024-12-28 04:34:03

使用 Python 及其标准库可以轻松完成行计数和解压缩 gzip 压缩文件。您可以一次性完成所有操作：

import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
                            stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
    for i, line in enumerate(f):
         fastload_fifo.write(line)
    print "Number of lines", i + 1
os.unlink(fifo_path)

我不知道如何调用 Fastload - 在调用中替换正确的参数。

Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:

import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
                            stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
    for i, line in enumerate(f):
         fastload_fifo.write(line)
    print "Number of lines", i + 1
os.unlink(fifo_path)

I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.

回复收藏 0 原文

岁月染过的梦 2024-12-28 04:34:03

这可以通过一行简单的 bash 完成：

zcat myfile.gz | tee >(wc -l >&2) | fastload

这将在 stderr 上打印行数。如果您想要它在其他地方，您可以根据需要重定向 wc 输出。

This can be done in one simple line of bash:

zcat myfile.gz | tee >(wc -l >&2) | fastload

This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.

回复收藏 0 原文

深海不蓝 2024-12-28 04:34:03

实际上，根本不可能将数据通过管道传输到 Fastload，因此如果有人可以在这里发布一个确切的示例，那就太好了。

来自有关 Fastload 配置的 Teradata 文档 http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556

FILE=文件名
指定包含输入数据的数据源名称的关键字短语。 fileid 必须引用常规文件。 具体来说，不支持管道。

回复收藏 0 原文

~没有更多了~

关于作者

伪心

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

将文件内容发送到管道并一步计算 # 行的 Pythonic 方式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

将文件内容发送到管道并一步计算 # 行的 Pythonic 方式

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。