使用 subprocess.Popen 时将大量数据通过管道传输到 stdin

发布于 2024-11-05 15:05:43 字数 1389 浏览 0 评论 0原文

我有点难以理解解决这个简单问题的 python 方法是什么。

我的问题很简单。如果您使用以下代码，它将挂起。这在子流程模块文档中有详细记录。

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
for i in range(100000):
    proc.stdin.write('%d\n' % i)
output = proc.communicate()[0]
print output

寻找一个解决方案（有一个非常有洞察力的线程，但我现在失去了它）我发现这个解决方案（除其他外）使用显式分叉：

import os
import sys
from subprocess import Popen, PIPE

def produce(to_sed):
    for i in range(100000):
        to_sed.write("%d\n" % i)
        to_sed.flush()
    #this would happen implicitly, anyway, but is here for the example
    to_sed.close()

def consume(from_sed):
    while 1:
        res = from_sed.readline()
        if not res:
            sys.exit(0)
            #sys.exit(proc.poll())
        print 'received: ', [res]

def main():
    proc = Popen(['cat','-'],stdin=PIPE,stdout=PIPE)
    to_sed = proc.stdin
    from_sed = proc.stdout

    pid = os.fork()
    if pid == 0 :
        from_sed.close()
        produce(to_sed)
        return
    else :
        to_sed.close()
        consume(from_sed)

if __name__ == '__main__':
    main()

虽然这个解决方案在概念上很容易理解，但它使用了一个多进程并且与子流程模块相比，其级别太低（只是为了隐藏此类内容......）。

我想知道：是否有一个简单而干净的解决方案，使用不会挂起的子流程模块，或者要实现这个模式，我必须退一步并实现旧式选择循环或显式分叉？

谢谢

原文

I'm kind of struggling to understand what is the python way of solving this simple problem.

My problem is quite simple. If you use the follwing code it will hang. This is well documented in the subprocess module doc.

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
for i in range(100000):
    proc.stdin.write('%d\n' % i)
output = proc.communicate()[0]
print output

Searching for a solution (there is a very insightful thread, but I've lost it now) I found this solution (among others) that uses an explicit fork:

import os
import sys
from subprocess import Popen, PIPE

def produce(to_sed):
    for i in range(100000):
        to_sed.write("%d\n" % i)
        to_sed.flush()
    #this would happen implicitly, anyway, but is here for the example
    to_sed.close()

def consume(from_sed):
    while 1:
        res = from_sed.readline()
        if not res:
            sys.exit(0)
            #sys.exit(proc.poll())
        print 'received: ', [res]

def main():
    proc = Popen(['cat','-'],stdin=PIPE,stdout=PIPE)
    to_sed = proc.stdin
    from_sed = proc.stdout

    pid = os.fork()
    if pid == 0 :
        from_sed.close()
        produce(to_sed)
        return
    else :
        to_sed.close()
        consume(from_sed)

if __name__ == '__main__':
    main()

While this solution is conceptually very easy to understand, it uses one more process and stuck as too low level compared to the subprocess module (that is there just to hide this kind of things...).

I'm wondering: is there a simple and clean solution using the subprocess module that won't hung or to implement this patter I have to do a step back and implement an old-style select loop or an explicit fork?

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

酒绊 2024-11-12 15:05:43

如果您想要一个纯Python解决方案，则需要将读取器或写入器放在单独的线程中。 threading 包是一种轻量级的方法来实现这一点，可以方便地访问公共对象并且没有混乱的分叉。

import subprocess
import threading
import sys

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
def writer():
    for i in range(100000):
        proc.stdin.write(b'%d\n' % i)
    proc.stdin.close()
thread = threading.Thread(target=writer)
thread.start()
for line in proc.stdout:
    sys.stdout.write(line.decode())
thread.join()
proc.wait()

看到 subprocess 模块经过现代化以支持流和协程可能会很不错，这将允许更优雅地构建混合 Python 片段和 shell 片段的管道。

If you want a pure Python solution, you need to put either the reader or the writer in a separate thread. The threading package is a lightweight way to do this, with convenient access to common objects and no messy forking.

import subprocess
import threading
import sys

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )
def writer():
    for i in range(100000):
        proc.stdin.write(b'%d\n' % i)
    proc.stdin.close()
thread = threading.Thread(target=writer)
thread.start()
for line in proc.stdout:
    sys.stdout.write(line.decode())
thread.join()
proc.wait()

It might be neat to see the subprocess module modernized to support streams and coroutines, which would allow pipelines that mix Python pieces and shell pieces to be constructed more elegantly.

回复收藏 0 原文

醉城メ夜风 2024-11-12 15:05:43

如果不想将所有数据保留在内存中，则必须使用 select。例如：

import subprocess
from select import select
import os

proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

i = 0;
while True:
    rlist, wlist, xlist = [proc.stdout], [], []
    if i < 100000:
        wlist.append(proc.stdin)
    rlist, wlist, xlist = select(rlist, wlist, xlist)
    if proc.stdout in rlist:
        out = os.read(proc.stdout.fileno(), 10)
        print out,
        if not out:
            break
    if proc.stdin in wlist:
        proc.stdin.write('%d\n' % i)
        i += 1
        if i >= 100000:
            proc.stdin.close()

If you don't want to keep all the data in memory, you have to use select. E.g. something like:

import subprocess
from select import select
import os

proc = subprocess.Popen(['cat'], stdin=subprocess.PIPE, stdout=subprocess.PIPE)

i = 0;
while True:
    rlist, wlist, xlist = [proc.stdout], [], []
    if i < 100000:
        wlist.append(proc.stdin)
    rlist, wlist, xlist = select(rlist, wlist, xlist)
    if proc.stdout in rlist:
        out = os.read(proc.stdout.fileno(), 10)
        print out,
        if not out:
            break
    if proc.stdin in wlist:
        proc.stdin.write('%d\n' % i)
        i += 1
        if i >= 100000:
            proc.stdin.close()

回复收藏 0 原文

看海 2024-11-12 15:05:43

一旦 cat 的标准输出操作系统管道缓冲区已满，您的代码就会死锁。如果您使用stdout=PIPE;你必须及时消耗它，否则可能会发生你的情况的僵局。

如果进程运行时不需要输出；您可以将其重定向到临时文件：

#!/usr/bin/env python3
import subprocess
import tempfile

with tempfile.TemporaryFile('r+') as output_file:
    with subprocess.Popen(['cat'],
                          stdin=subprocess.PIPE,
                          stdout=output_file,
                          universal_newlines=True) as process:
        for i in range(100000):
            print(i, file=process.stdin)
    output_file.seek(0)  # rewind (and sync with the disk)
    print(output_file.readline(), end='')  # get  the first line of the output

如果输入/输出很小（适合内存）；您可以使用 .communicate() 一次传递所有输入并同时获取所有输出，它可以为您同时读/写：

#!/usr/bin/env python3
import subprocess

cp = subprocess.run(['cat'], input='\n'.join(['%d' % i for i in range(100000)]),
                    stdout=subprocess.PIPE, universal_newlines=True)
print(cp.stdout.splitlines()[-1]) # print the last line

要手动同时读/写，您可以使用线程、asyncio、fcntl等等。@Jed 提供了一个简单的基于线程的解决方案。这是基于 asyncio 的解决方案：

#!/usr/bin/env python3
import asyncio
import sys
from subprocess import PIPE

async def pump_input(writer):
     try:
         for i in range(100000):
             writer.write(b'%d\n' % i)
             await writer.drain()
     finally:
         writer.close()

async def run():
    # start child process
    # NOTE: universal_newlines parameter is not supported
    process = await asyncio.create_subprocess_exec('cat', stdin=PIPE, stdout=PIPE)
    asyncio.ensure_future(pump_input(process.stdin)) # write input
    async for line in process.stdout: # consume output
        print(int(line)**2) # print squares
    return await process.wait()  # wait for the child process to exit


if sys.platform.startswith('win'):
    loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
    asyncio.set_event_loop(loop)
else:
    loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()

在 Unix 上，您可以使用基于 fcntl 的解决方案：

#!/usr/bin/env python3
import sys
from fcntl import fcntl, F_GETFL, F_SETFL
from os import O_NONBLOCK
from shutil import copyfileobj
from subprocess import Popen, PIPE, _PIPE_BUF as PIPE_BUF

def make_blocking(pipe, blocking=True):
    fd = pipe.fileno()
    if not blocking:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) # set O_NONBLOCK
    else:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) & ~O_NONBLOCK) # clear it


with Popen(['cat'], stdin=PIPE, stdout=PIPE) as process:
    make_blocking(process.stdout, blocking=False)
    with process.stdin:
        for i in range(100000):
            #NOTE: the mode is block-buffered (default) and therefore
            # `cat` won't see it immidiately
            process.stdin.write(b'%d\n' % i)
            # a deadblock may happen here with a *blocking* pipe
            output = process.stdout.read(PIPE_BUF)
            if output is not None:
                sys.stdout.buffer.write(output)
    # read the rest
    make_blocking(process.stdout)
    copyfileobj(process.stdout, sys.stdout.buffer)

Your code deadlocks as soon as cat's stdout OS pipe buffer is full. If you use stdout=PIPE; you have to consume it in time otherwise the deadlock as in your case may happen.

If you don't need the output while the process is running; you could redirect it to a temporary file:

#!/usr/bin/env python3
import subprocess
import tempfile

with tempfile.TemporaryFile('r+') as output_file:
    with subprocess.Popen(['cat'],
                          stdin=subprocess.PIPE,
                          stdout=output_file,
                          universal_newlines=True) as process:
        for i in range(100000):
            print(i, file=process.stdin)
    output_file.seek(0)  # rewind (and sync with the disk)
    print(output_file.readline(), end='')  # get  the first line of the output

If the input/output are small (fit in memory); you could pass the input all at once and get the output all at once using .communicate() that reads/writes concurrently for you:

#!/usr/bin/env python3
import subprocess

cp = subprocess.run(['cat'], input='\n'.join(['%d' % i for i in range(100000)]),
                    stdout=subprocess.PIPE, universal_newlines=True)
print(cp.stdout.splitlines()[-1]) # print the last line

To read/write concurrently manually, you could use threads, asyncio, fcntl, etc. @Jed provided a simple thread-based solution. Here's asyncio-based solution:

#!/usr/bin/env python3
import asyncio
import sys
from subprocess import PIPE

async def pump_input(writer):
     try:
         for i in range(100000):
             writer.write(b'%d\n' % i)
             await writer.drain()
     finally:
         writer.close()

async def run():
    # start child process
    # NOTE: universal_newlines parameter is not supported
    process = await asyncio.create_subprocess_exec('cat', stdin=PIPE, stdout=PIPE)
    asyncio.ensure_future(pump_input(process.stdin)) # write input
    async for line in process.stdout: # consume output
        print(int(line)**2) # print squares
    return await process.wait()  # wait for the child process to exit


if sys.platform.startswith('win'):
    loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
    asyncio.set_event_loop(loop)
else:
    loop = asyncio.get_event_loop()
loop.run_until_complete(run())
loop.close()

On Unix, you could use fcntl-based solution:

#!/usr/bin/env python3
import sys
from fcntl import fcntl, F_GETFL, F_SETFL
from os import O_NONBLOCK
from shutil import copyfileobj
from subprocess import Popen, PIPE, _PIPE_BUF as PIPE_BUF

def make_blocking(pipe, blocking=True):
    fd = pipe.fileno()
    if not blocking:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) | O_NONBLOCK) # set O_NONBLOCK
    else:
        fcntl(fd, F_SETFL, fcntl(fd, F_GETFL) & ~O_NONBLOCK) # clear it


with Popen(['cat'], stdin=PIPE, stdout=PIPE) as process:
    make_blocking(process.stdout, blocking=False)
    with process.stdin:
        for i in range(100000):
            #NOTE: the mode is block-buffered (default) and therefore
            # `cat` won't see it immidiately
            process.stdin.write(b'%d\n' % i)
            # a deadblock may happen here with a *blocking* pipe
            output = process.stdout.read(PIPE_BUF)
            if output is not None:
                sys.stdout.buffer.write(output)
    # read the rest
    make_blocking(process.stdout)
    copyfileobj(process.stdout, sys.stdout.buffer)

回复收藏 0 原文

软甜啾 2024-11-12 15:05:43

这是我用来通过子进程加载 6G mysql 转储文件的东西。远离 shell=True。不安全并且从进程中启动浪费资源。

import subprocess

fhandle = None

cmd = [mysql_path,
      "-u", mysql_user, "-p" + mysql_pass],
      "-h", host, database]

fhandle = open(dump_file, 'r')
p = subprocess.Popen(cmd, stdin=fhandle, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

(stdout,stderr) = p.communicate()

fhandle.close()

Here's something I used to load 6G mysql dump file loads via subprocess. Stay away from shell=True. Not secure and start out of process wasting resources.

import subprocess

fhandle = None

cmd = [mysql_path,
      "-u", mysql_user, "-p" + mysql_pass],
      "-h", host, database]

fhandle = open(dump_file, 'r')
p = subprocess.Popen(cmd, stdin=fhandle, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

(stdout,stderr) = p.communicate()

fhandle.close()

回复收藏 0 原文

荆棘i 2024-11-12 15:05:43

对于这种事情，shell 比 subprocess 工作得更好。

编写非常简单的Python应用程序，从sys.stdin读取并写入sys.stdout。

使用 shell 管道将简单的应用程序连接在一起。

如果需要，可以使用 subprocess 启动管道或仅编写一行 shell 脚本。

python part1.py | python part2.py

这是非常非常有效的。只要保持非常简单，它也可以移植到所有 Linux（和 Windows）。

For this kind of thing, the shell works a lot better than subprocess.

Write very simple Python apps which read from sys.stdin and write to sys.stdout.

Connect the simple apps together using a shell pipeline.

If you want, start the pipeline using subprocess or simply write a one-line shell script.

python part1.py | python part2.py

This is very, very efficient. It's also portable to all Linux (and Windows) as long as you keep it very simple.

回复收藏 0 原文

圈圈圆圆圈圈 2024-11-12 15:05:43

使用 aiofiles 和python 3.5 中的 asyncio：

有点复杂，但只需要 1024 字节内存即可在 stdin 中写入！

import asyncio
import aiofiles
import sys
from os.path import dirname, join, abspath
import subprocess as sb


THIS_DIR = abspath(dirname(__file__))
SAMPLE_FILE = join(THIS_DIR, '../src/hazelnut/tests/stuff/sample.mp4')
DEST_PATH = '/home/vahid/Desktop/sample.mp4'


async def async_file_reader(f, buffer):
    async for l in f:
        if l:
            buffer.append(l)
        else:
            break
    print('reader done')


async def async_file_writer(source_file, target_file):
    length = 0
    while True:
        input_chunk = await source_file.read(1024)
        if input_chunk:
            length += len(input_chunk)
            target_file.write(input_chunk)
            await target_file.drain()
        else:
            target_file.write_eof()
            break

    print('writer done: %s' % length)


async def main():
    dir_name = dirname(DEST_PATH)
    remote_cmd = 'ssh localhost mkdir -p %s && cat - > %s' % (dir_name, DEST_PATH)

    stdout, stderr = [], []
    async with aiofiles.open(SAMPLE_FILE, mode='rb') as f:
        cmd = await asyncio.create_subprocess_shell(
            remote_cmd,
            stdin=sb.PIPE,
            stdout=sb.PIPE,
            stderr=sb.PIPE,
        )

        await asyncio.gather(*(
            async_file_reader(cmd.stdout, stdout),
            async_file_reader(cmd.stderr, stderr),
            async_file_writer(f, cmd.stdin)
        ))

        print('EXIT STATUS: %s' % await cmd.wait())

    stdout, stderr = '\n'.join(stdout), '\n'.join(stderr)

    if stdout:
        print(stdout)

    if stderr:
        print(stderr, file=sys.stderr)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

结果：

writer done: 383631
reader done
reader done
EXIT STATUS: 0

Using the aiofiles & asyncio in python 3.5:

A bit complicated, but you need only 1024 Bytes memory to writing in stdin!

import asyncio
import aiofiles
import sys
from os.path import dirname, join, abspath
import subprocess as sb


THIS_DIR = abspath(dirname(__file__))
SAMPLE_FILE = join(THIS_DIR, '../src/hazelnut/tests/stuff/sample.mp4')
DEST_PATH = '/home/vahid/Desktop/sample.mp4'


async def async_file_reader(f, buffer):
    async for l in f:
        if l:
            buffer.append(l)
        else:
            break
    print('reader done')


async def async_file_writer(source_file, target_file):
    length = 0
    while True:
        input_chunk = await source_file.read(1024)
        if input_chunk:
            length += len(input_chunk)
            target_file.write(input_chunk)
            await target_file.drain()
        else:
            target_file.write_eof()
            break

    print('writer done: %s' % length)


async def main():
    dir_name = dirname(DEST_PATH)
    remote_cmd = 'ssh localhost mkdir -p %s && cat - > %s' % (dir_name, DEST_PATH)

    stdout, stderr = [], []
    async with aiofiles.open(SAMPLE_FILE, mode='rb') as f:
        cmd = await asyncio.create_subprocess_shell(
            remote_cmd,
            stdin=sb.PIPE,
            stdout=sb.PIPE,
            stderr=sb.PIPE,
        )

        await asyncio.gather(*(
            async_file_reader(cmd.stdout, stdout),
            async_file_reader(cmd.stderr, stderr),
            async_file_writer(f, cmd.stdin)
        ))

        print('EXIT STATUS: %s' % await cmd.wait())

    stdout, stderr = '\n'.join(stdout), '\n'.join(stderr)

    if stdout:
        print(stdout)

    if stderr:
        print(stderr, file=sys.stderr)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Result:

writer done: 383631
reader done
reader done
EXIT STATUS: 0

回复收藏 0 原文

过潦 2024-11-12 15:05:43

下面是一个使用管道从 gzip 一次读取一条记录的示例（Python 3）：

cmd = 'gzip -dc compressed_file.gz'
pipe = Popen(cmd, stdout=PIPE).stdout

for line in pipe:
    print(":", line.decode(), end="")

我知道有一个标准模块，它只是作为一个示例。您可以使用通信方法一次性读取整个输出（如 shell 反引号），但显然您必须注意内存大小。

下面是一个在 Linux 上将记录写入 lp(1) 程序的示例（又是 Python 3）：

cmd = 'lp -'
proc = Popen(cmd, stdin=PIPE)
proc.communicate(some_data.encode())

Here is an example (Python 3) of reading one record at a time from gzip using a pipe:

cmd = 'gzip -dc compressed_file.gz'
pipe = Popen(cmd, stdout=PIPE).stdout

for line in pipe:
    print(":", line.decode(), end="")

I know there is a standard module for that, it is just meant as an example. You can read the whole output in one go (like shell back-ticks) using the communicate method, but obviously you hav eto be careful of memory size.

Here is an example (Python 3 again) of writing records to the lp(1) program on Linux:

cmd = 'lp -'
proc = Popen(cmd, stdin=PIPE)
proc.communicate(some_data.encode())

回复收藏 0 原文

晨曦÷微暖 2024-11-12 15:05:43

现在我知道这不会完全满足您的纯粹主义者，因为输入必须适合内存，并且您无法选择与输入输出交互工作，但至少这在您的示例中效果很好。通信方法可以选择将输入作为参数，如果您以这种方式向进程提供输入，它将起作用。

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )

input = "".join('{0:d}\n'.format(i) for i in range(100000))
output = proc.communicate(input)[0]
print output

至于更大的问题，你可以继承 Popen，重写 __init__ 来接受类似流的对象作为 stdin、stdout、stderr 的参数，并重写 _communicate 方法（毛茸茸的 for跨平台，您需要执行两次（请参阅 subprocess.py 源代码）以在 stdin 流上调用 read() 并将输出 write() 到 stdout 和 stderr 流。这种方法让我困扰的是，据我所知，它还没有被完成。当以前没有做过明显的事情时，通常有一个原因（它没有按预期工作），但我不明白为什么它不应该，除了你需要流在 Windows 中是线程安全的事实。

Now I know this is not going to satisfy the purist in you completely, as the input will have to fit in memory, and you have no option to work interactively with input-output, but at least this works fine on your example. The communicate method optionally takes the input as an argument, and if you feed your process its input this way, it will work.

import subprocess

proc = subprocess.Popen(['cat','-'],
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        )

input = "".join('{0:d}\n'.format(i) for i in range(100000))
output = proc.communicate(input)[0]
print output

As for the larger problem, you can subclass Popen, rewrite __init__ to accept stream-like objects as arguments to stdin, stdout, stderr, and rewrite the _communicate method (hairy for crossplatform, you need to do it twice, see the subprocess.py source) to call read() on the stdin stream and write() the output to the stdout and stderr streams. What bothers me about this approach is that as far as I know, it hasn't already been done. When obvious things have not been done before, there's usually a reason (it doesn't work as intended), but I can't see why it shoudn't, apart from the fact you need the streams to be thread-safe in Windows.

回复收藏 0 原文

幽蝶幻影 2024-11-12 15:05:43

我能想到的最简单的解决方案：

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=1)
Thread(target=lambda: any((p.stdin.write(b) for b in s)) or p.stdin.close()).start()
print (p.stdout.read())

缓冲版本：

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
n = 1024 # buffer size
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=n)
Thread(target=lambda: any((p.stdin.write(c) for c in (s[i:i+n] for i in xrange(0, len(s), n)))) or p.stdin.close()).start()
print (p.stdout.read())

The simplest solution I can think of:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=1)
Thread(target=lambda: any((p.stdin.write(b) for b in s)) or p.stdin.close()).start()
print (p.stdout.read())

Buffered version:

from subprocess import Popen, PIPE
from threading import Thread

s = map(str,xrange(10000)) # a large string
n = 1024 # buffer size
p = Popen(['cat'], stdin=PIPE, stdout=PIPE, bufsize=n)
Thread(target=lambda: any((p.stdin.write(c) for c in (s[i:i+n] for i in xrange(0, len(s), n)))) or p.stdin.close()).start()
print (p.stdout.read())

回复收藏 0 原文

回眸一遍 2024-11-12 15:05:43

我正在寻找一个示例代码来增量迭代进程输出，因为该进程消耗来自提供的迭代器的输入（也增量）。基本上：

import string
import random

# That's what I consider a very useful function, though didn't
# find any existing implementations.
def process_line_reader(args, stdin_lines):
    # args - command to run, same as subprocess.Popen
    # stdin_lines - iterable with lines to send to process stdin
    # returns - iterable with lines received from process stdout
    pass

# Returns iterable over n random strings. n is assumed to be infinity if negative.
# Just an example of function that returns potentially unlimited number of lines.
def random_lines(n, M=8):
    while 0 != n:
        yield "".join(random.choice(string.letters) for _ in range(M))
        if 0 < n:
            n -= 1

# That's what I consider to be a very convenient use case for
# function proposed above.
def print_many_uniq_numbered_random_lines():
    i = 0
    for line in process_line_reader(["uniq", "-i"], random_lines(100500 * 9000)):
        # Key idea here is that `process_line_reader` will feed random lines into
        # `uniq` process stdin as lines are consumed from returned iterable.
        print "#%i: %s" % (i, line)
        i += 1

这里建议的一些解决方案允许使用线程（但并不总是方便）或使用 asyncio（在 Python 2.x 中不可用）来完成此操作。下面是允许执行此操作的工作实现示例。

import subprocess
import os
import fcntl
import select

class nonblocking_io(object):
    def __init__(self, f):
        self._fd = -1
        if type(f) is int:
            self._fd = os.dup(f)
            os.close(f)
        elif type(f) is file:
            self._fd = os.dup(f.fileno())
            f.close()
        else:
            raise TypeError("Only accept file objects or interger file descriptors")
        flag = fcntl.fcntl(self._fd, fcntl.F_GETFL)
        fcntl.fcntl(self._fd, fcntl.F_SETFL, flag | os.O_NONBLOCK)
    def __enter__(self):
        return self
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def fileno(self):
        return self._fd
    def close(self):
        if 0 <= self._fd:
            os.close(self._fd)
            self._fd = -1

class nonblocking_line_writer(nonblocking_io):
    def __init__(self, f, lines, autoclose=True, buffer_size=16*1024, encoding="utf-8", linesep=os.linesep):
        super(nonblocking_line_writer, self).__init__(f)
        self._lines = iter(lines)
        self._lines_ended = False
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._buffer_offset = 0
        self._buffer = bytearray()
        self._encoding = encoding
        self._linesep = bytearray(linesep, encoding)
    # Returns False when `lines` iterable is exhausted and all pending data is written
    def continue_writing(self):
        while True:
            if self._buffer_offset < len(self._buffer):
                n = os.write(self._fd, self._buffer[self._buffer_offset:])
                self._buffer_offset += n
                if self._buffer_offset < len(self._buffer):
                    return True
            if self._lines_ended:
                if self._autoclose:
                    self.close()
                return False
            self._buffer[:] = []
            self._buffer_offset = 0
            while len(self._buffer) < self._buffer_size:
                line = next(self._lines, None)
                if line is None:
                    self._lines_ended = True
                    break
                self._buffer.extend(bytearray(line, self._encoding))
                self._buffer.extend(self._linesep)

class nonblocking_line_reader(nonblocking_io):
    def __init__(self, f, autoclose=True, buffer_size=16*1024, encoding="utf-8"):
        super(nonblocking_line_reader, self).__init__(f)
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._encoding = encoding
        self._file_ended = False
        self._line_part = ""
    # Returns (lines, more) tuple, where lines is iterable with lines read and more will
    # be set to False after EOF.
    def continue_reading(self):
        lines = []
        while not self._file_ended:
            data = os.read(self._fd, self._buffer_size)
            if 0 == len(data):
                self._file_ended = True
                if self._autoclose:
                    self.close()
                if 0 < len(self._line_part):
                    lines.append(self._line_part.decode(self._encoding))
                    self._line_part = ""
                break
            for line in data.splitlines(True):
                self._line_part += line
                if self._line_part.endswith(("\n", "\r")):
                    lines.append(self._line_part.decode(self._encoding).rstrip("\n\r"))
                    self._line_part = ""
            if len(data) < self._buffer_size:
                break
        return (lines, not self._file_ended)

class process_line_reader(object):
    def __init__(self, args, stdin_lines):
        self._p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        self._reader = nonblocking_line_reader(self._p.stdout)
        self._writer = nonblocking_line_writer(self._p.stdin, stdin_lines)
        self._iterator = self._communicate()
    def __iter__(self):
        return self._iterator
    def __enter__(self):
        return self._iterator
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def _communicate(self):
        read_set = [self._reader]
        write_set = [self._writer]
        while read_set or write_set:
            try:
                rlist, wlist, xlist = select.select(read_set, write_set, [])
            except select.error, e:
                if e.args[0] == errno.EINTR:
                    continue
                raise
            if self._reader in rlist:
                stdout_lines, more = self._reader.continue_reading()
                for line in stdout_lines:
                    yield line
                if not more:
                    read_set.remove(self._reader)
            if self._writer in wlist:
                if not self._writer.continue_writing():
                    write_set.remove(self._writer)
        self.close()
    def lines(self):
        return self._iterator
    def close(self):
        if self._iterator is not None:
            self._reader.close()
            self._writer.close()
            self._p.wait()
            self._iterator = None

I was looking for an example code to iterate over process output incrementally as this process consumes its input from provided iterator (incrementally as well). Basically:

import string
import random

# That's what I consider a very useful function, though didn't
# find any existing implementations.
def process_line_reader(args, stdin_lines):
    # args - command to run, same as subprocess.Popen
    # stdin_lines - iterable with lines to send to process stdin
    # returns - iterable with lines received from process stdout
    pass

# Returns iterable over n random strings. n is assumed to be infinity if negative.
# Just an example of function that returns potentially unlimited number of lines.
def random_lines(n, M=8):
    while 0 != n:
        yield "".join(random.choice(string.letters) for _ in range(M))
        if 0 < n:
            n -= 1

# That's what I consider to be a very convenient use case for
# function proposed above.
def print_many_uniq_numbered_random_lines():
    i = 0
    for line in process_line_reader(["uniq", "-i"], random_lines(100500 * 9000)):
        # Key idea here is that `process_line_reader` will feed random lines into
        # `uniq` process stdin as lines are consumed from returned iterable.
        print "#%i: %s" % (i, line)
        i += 1

Some of solutions suggested here allow to do it with threads (but it's not always convenient) or with asyncio (which is not available in Python 2.x). Below is example of working implementation that allows to do it.

import subprocess
import os
import fcntl
import select

class nonblocking_io(object):
    def __init__(self, f):
        self._fd = -1
        if type(f) is int:
            self._fd = os.dup(f)
            os.close(f)
        elif type(f) is file:
            self._fd = os.dup(f.fileno())
            f.close()
        else:
            raise TypeError("Only accept file objects or interger file descriptors")
        flag = fcntl.fcntl(self._fd, fcntl.F_GETFL)
        fcntl.fcntl(self._fd, fcntl.F_SETFL, flag | os.O_NONBLOCK)
    def __enter__(self):
        return self
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def fileno(self):
        return self._fd
    def close(self):
        if 0 <= self._fd:
            os.close(self._fd)
            self._fd = -1

class nonblocking_line_writer(nonblocking_io):
    def __init__(self, f, lines, autoclose=True, buffer_size=16*1024, encoding="utf-8", linesep=os.linesep):
        super(nonblocking_line_writer, self).__init__(f)
        self._lines = iter(lines)
        self._lines_ended = False
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._buffer_offset = 0
        self._buffer = bytearray()
        self._encoding = encoding
        self._linesep = bytearray(linesep, encoding)
    # Returns False when `lines` iterable is exhausted and all pending data is written
    def continue_writing(self):
        while True:
            if self._buffer_offset < len(self._buffer):
                n = os.write(self._fd, self._buffer[self._buffer_offset:])
                self._buffer_offset += n
                if self._buffer_offset < len(self._buffer):
                    return True
            if self._lines_ended:
                if self._autoclose:
                    self.close()
                return False
            self._buffer[:] = []
            self._buffer_offset = 0
            while len(self._buffer) < self._buffer_size:
                line = next(self._lines, None)
                if line is None:
                    self._lines_ended = True
                    break
                self._buffer.extend(bytearray(line, self._encoding))
                self._buffer.extend(self._linesep)

class nonblocking_line_reader(nonblocking_io):
    def __init__(self, f, autoclose=True, buffer_size=16*1024, encoding="utf-8"):
        super(nonblocking_line_reader, self).__init__(f)
        self._autoclose = autoclose
        self._buffer_size = buffer_size
        self._encoding = encoding
        self._file_ended = False
        self._line_part = ""
    # Returns (lines, more) tuple, where lines is iterable with lines read and more will
    # be set to False after EOF.
    def continue_reading(self):
        lines = []
        while not self._file_ended:
            data = os.read(self._fd, self._buffer_size)
            if 0 == len(data):
                self._file_ended = True
                if self._autoclose:
                    self.close()
                if 0 < len(self._line_part):
                    lines.append(self._line_part.decode(self._encoding))
                    self._line_part = ""
                break
            for line in data.splitlines(True):
                self._line_part += line
                if self._line_part.endswith(("\n", "\r")):
                    lines.append(self._line_part.decode(self._encoding).rstrip("\n\r"))
                    self._line_part = ""
            if len(data) < self._buffer_size:
                break
        return (lines, not self._file_ended)

class process_line_reader(object):
    def __init__(self, args, stdin_lines):
        self._p = subprocess.Popen(args, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        self._reader = nonblocking_line_reader(self._p.stdout)
        self._writer = nonblocking_line_writer(self._p.stdin, stdin_lines)
        self._iterator = self._communicate()
    def __iter__(self):
        return self._iterator
    def __enter__(self):
        return self._iterator
    def __exit__(self, type, value, traceback):
        self.close()
        return False
    def _communicate(self):
        read_set = [self._reader]
        write_set = [self._writer]
        while read_set or write_set:
            try:
                rlist, wlist, xlist = select.select(read_set, write_set, [])
            except select.error, e:
                if e.args[0] == errno.EINTR:
                    continue
                raise
            if self._reader in rlist:
                stdout_lines, more = self._reader.continue_reading()
                for line in stdout_lines:
                    yield line
                if not more:
                    read_set.remove(self._reader)
            if self._writer in wlist:
                if not self._writer.continue_writing():
                    write_set.remove(self._writer)
        self.close()
    def lines(self):
        return self._iterator
    def close(self):
        if self._iterator is not None:
            self._reader.close()
            self._writer.close()
            self._p.wait()
            self._iterator = None

回复收藏 0 原文

~没有更多了~