如何为 Popen 的标准输入连接多个文件
我正在将 bash 脚本移植到 python 2.6,并且想要替换一些代码:
cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2
我想我想要类似于 http://docs.python.org/release/2.6/library/subprocess.html,唉...
p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
但是,我不确定如何最好地提供 p1
的 stdin
值,以便它连接输入文件。似乎我可以添加......
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...
但这似乎超出了使用(缓慢、低效)管道来调用具有重要功能的外部程序的范围。 (任何像样的 shell 都会在内部执行 cat
。)
因此,我可以想象一个满足文件对象 API 要求的自定义类,因此可以用于 p1 的 stdin,连接任意其他文件对象。 (编辑:现有答案解释了为什么这是不可能的)
Python 2.6 是否有解决此需求/想要的机制,或者可能另一个 Popen
到 cat
在 python 圈子里被认为是完美的吗?
谢谢。
I'm porting a bash script to python 2.6, and want to replace some code:
cat $( ls -tr xyz_`date +%F`_*.log ) | filter args > bzip2
I guess I want something similar to the "Replacing shell pipe line" example at http://docs.python.org/release/2.6/library/subprocess.html, ala...
p1 = Popen(["filter", "args"], stdin=*?WHAT?*, stdout=PIPE)
p2 = Popen(["bzip2"], stdin=p1.stdout, stdout=PIPE)
output = p2.communicate()[0]
But, I'm not sure how best to provide p1
's stdin
value so it concatenates the input files. Seems I could add...
p0 = Popen(["cat", "file1", "file2"...], stdout=PIPE)
p1 = ... stdin=p0.stdout ...
...but that seems to be crossing beyond use of (slow, inefficient) pipes to call external programs with significant functionality. (Any decent shell performs the cat
internally.)
So, I can imagine a custom class that satisfies the file object API requirements and can therefore be used for p1's stdin, concatenating arbitrary other file objects. (EDIT: existing answers explain why this isn't possible)
Does python 2.6 have a mechanism addressing this need/want, or might another Popen
to cat
be considered perfectly fine in python circles?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以用 Python 代码替换您正在执行的所有操作,但外部实用程序除外。这样,只要您的外部实用程序是可移植的,您的程序就将保持可移植性。您还可以考虑将 C++ 程序转换为库并使用 Cython 与其交互。正如 Messa 所示,
date
被time.strftime
替换,通配符由glob.glob
完成,cat
可以替换为读取列表中的所有文件并将它们写入程序的输入。对bzip2
的调用可以替换为bz2
模块,但这会使您的程序变得复杂,因为您必须同时读写。为此,如果数据很大,您需要使用p.communicate
或线程(select.select
会是更好的选择,但它不适用于视窗)。添加:如何检测文件输入类型
您可以使用文件扩展名或 libmagic 的 Python 绑定来检测文件的压缩方式。下面的代码示例同时执行这两项操作,并自动选择
magic
(如果可用)。您可以选择适合您需要的部分并根据您的需要进行调整。open_autodecompress
应该检测 mime 编码并使用适当的解压缩器(如果可用)打开文件。You can replace everything that you're doing with Python code, except for your external utility. That way your program will remain portable as long as your external util is portable. You can also consider turning the C++ program into a library and using Cython to interface with it. As Messa showed,
date
is replaced withtime.strftime
, globbing is done withglob.glob
andcat
can be replaced with reading all the files in the list and writing them to the input of your program. The call tobzip2
can be replaced with thebz2
module, but that will complicate your program because you'd have to read and write simultaneously. To do that, you need to either usep.communicate
or a thread if the data is huge (select.select
would be a better choice but it won't work on Windows).Addition: How to detect file input type
You can use either the file extension or the Python bindings for libmagic to detect how the file is compressed. Here's a code example that does both, and automatically chooses
magic
if it is available. You can take the part that suits your needs and adapt it to your needs. Theopen_autodecompress
should detect the mime encoding and open the file with the appropriate decompressor if it is available.如果您查看
subprocess
模块实现的内部,您将看到 std{in,out,err} 预计是支持fileno()
方法的文件对象,因此一个简单的串联具有 python 接口的类文件对象(甚至 StringIO 对象)在这里不适合。如果它是迭代器,而不是文件对象,您可以使用itertools.chain。
当然,牺牲内存消耗你可以这样做:
If you look inside the
subprocess
module implementation, you will see that std{in,out,err} are expected to be fileobjects supportingfileno()
method, so a simple concatinating file-like object with python interface (or even a StringIO object) is not suitable here.If it were iterators, not file objects, you could use
itertools.chain
.Of course, sacrificing the memory consumption you can do something like this:
使用子进程时,您必须考虑这样一个事实:Popen 在内部将使用文件描述符(处理程序)并为 stdin、stdout 和 stderr 调用 os.dup2(),然后再将它们传递给创建的子进程。
因此,如果您不想将系统 shell 管道与 Popen 一起使用:
我认为您的另一个选择是在 python 中编写一个 cat 函数,并以类似 cat 的方式生成一个 文件 并传递此 file 到 p1 stdin,不要考虑实现 io API 的类,因为它不会像我所说的那样工作,因为子进程在内部只会获取文件描述符。
话虽如此,我认为你更好的选择是使用 unix PIPE 方式,如 subprocess文档。
When using subprocess you have to consider the fact that internally Popen will use the file descriptor(handler) and call os.dup2() for stdin, stdout and stderr before passing them to the child process created.
So if you don't want to use system shell pipe with Popen:
I think your other option is to write a cat function in python and generate a file in cat-like way and pass this file to p1 stdin, don't think about a class that implement the io API because it will not work as i said because internally the child process will just get the file descriptors.
With that said i think your better option is to use unix PIPE way like in subprocess doc.
这应该很容易。首先,使用 os.pipe 创建一个管道,然后使用 Popen将管道的读取端作为标准输入的
过滤器
。然后,对于目录中名称与模式匹配的每个文件,只需将其内容传递到管道的写入端。这应该与 shell 命令cat ..._*.log | 完全相同。 filter args
确实如此。更新: 抱歉,不需要来自
os.pipe
的管道,我忘记了subprocess.Popen(..., stdin=subprocess.PIPE)
实际上为您创建了一个。另外,管道中不能填充太多数据,只有在读取先前的数据后才能将更多数据写入管道。因此解决方案(例如使用
wc -l
)是:使用示例:
This should be easy. First, create a pipe using os.pipe, then Popen the
filter
with read end of the pipe as standard input. Then for each file in the directory with name matching the pattern, just pass its contents to the write end of the pipe. This should be exactly the same what the shell commandcat ..._*.log | filter args
does.Update: Sorry, pipe from
os.pipe
is not needed, I forgot thatsubprocess.Popen(..., stdin=subprocess.PIPE)
actualy creates one for you. Also a pipe cannot be stuffed with too much data, more data can be written to a pipe only after the previous data are read.So the solution (for example with
wc -l
) would be:Usage example: