如何使用 Python 将 stdin/stdout 传输到 Perl 脚本

发布于 2024-12-23 12:09:16 字数 1222 浏览 0 评论 0原文

此 Python 代码通过 Perl 脚本很好地传输数据。

import subprocess
kw = {}
kw['executable'] = None
kw['shell'] = True
kw['stdin'] = None
kw['stdout'] = subprocess.PIPE
kw['stderr'] = subprocess.PIPE
args = ' '.join(['/usr/bin/perl','-w','/path/script.perl','<','/path/mydata'])
subproc = subprocess.Popen(args,**kw)
for line in iter(subproc.stdout.readline, ''):
    print line.rstrip().decode('UTF-8')

但是,它要求我首先将缓冲区保存到磁盘文件(/path/mydata)。在 Python 代码中循环数据并逐行传递到子进程会更干净,如下所示:

import subprocess
kw = {}
kw['executable'] = '/usr/bin/perl'
kw['shell'] = False
kw['stderr'] = subprocess.PIPE
kw['stdin'] = subprocess.PIPE
kw['stdout'] = subprocess.PIPE
args = ['-w','/path/script.perl',]
subproc = subprocess.Popen(args,**kw)
f = codecs.open('/path/mydata','r','UTF-8')
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    print line.strip()  ### code hangs after printing this ###
    for line in iter(subproc.stdout.readline, ''):
        print line.rstrip().decode('UTF-8')
subproc.terminate()
f.close()

在将第一行发送到子进程后,代码会随着 readline 挂起。我还有其他可执行文件完美地使用了完全相同的代码。

我的数据文件可能非常大(1.5 GB) 有没有办法在不保存到文件的情况下完成管道传输数据?我不想为了与其他系统兼容而重新编写perl脚本。

This Python code pipes data through Perl script fine.

import subprocess
kw = {}
kw['executable'] = None
kw['shell'] = True
kw['stdin'] = None
kw['stdout'] = subprocess.PIPE
kw['stderr'] = subprocess.PIPE
args = ' '.join(['/usr/bin/perl','-w','/path/script.perl','<','/path/mydata'])
subproc = subprocess.Popen(args,**kw)
for line in iter(subproc.stdout.readline, ''):
    print line.rstrip().decode('UTF-8')

However, it requires that I first to save my buffers to a disk file (/path/mydata). It's cleaner to loop through the data in Python code and pass line-by-line to the subprocess like this:

import subprocess
kw = {}
kw['executable'] = '/usr/bin/perl'
kw['shell'] = False
kw['stderr'] = subprocess.PIPE
kw['stdin'] = subprocess.PIPE
kw['stdout'] = subprocess.PIPE
args = ['-w','/path/script.perl',]
subproc = subprocess.Popen(args,**kw)
f = codecs.open('/path/mydata','r','UTF-8')
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    print line.strip()  ### code hangs after printing this ###
    for line in iter(subproc.stdout.readline, ''):
        print line.rstrip().decode('UTF-8')
subproc.terminate()
f.close()

The code hangs with the readline after sending the first line to the subprocess. I have other executables that use this exact same code perfectly.

My data files can be quite large (1.5 GB) Is there way to accomplish piping the data without saving to file? I don't want to re-write the perl script for compatibility with other systems.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜光 2024-12-30 12:09:16

您的代码在该行处阻塞:

for line in iter(subproc.stdout.readline, ''):

因为此迭代终止的唯一方法是到达 EOF(文件结束符),这将在子进程终止时发生。您不想等到进程终止,但是,您只想等到它完成处理发送给它的行。

此外,正如克里斯·摩根已经指出的那样,您遇到了缓冲问题。另一个stackoverflow 上的问题讨论了如何做到不- 使用子进程阻塞读取。我已经从该问题到您的问题对代码进行了快速而肮脏的修改:

def enqueue_output(out, queue):
    for line in iter(out.readline, ''):
        queue.put(line)
    out.close()

kw = {}
kw['executable'] = '/usr/bin/perl'
kw['shell'] = False
kw['stderr'] = subprocess.PIPE
kw['stdin'] = subprocess.PIPE
kw['stdout'] = subprocess.PIPE
args = ['-w','/path/script.perl',]
subproc = subprocess.Popen(args, **kw)
f = codecs.open('/path/mydata','r','UTF-8')
q = Queue.Queue()
t = threading.Thread(target = enqueue_output, args = (subproc.stdout, q))
t.daemon = True
t.start()
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    print "Sent:", line.strip()  ### code hangs after printing this ###
    try:
        line = q.get_nowait()
    except Queue.Empty:
        pass
    else:
        print "Received:", line.rstrip().decode('UTF-8')

subproc.terminate()
f.close()

您很可能需要对此代码进行修改,但至少它不会阻塞。

Your code is blocking at the line:

for line in iter(subproc.stdout.readline, ''):

because the only way this iteration can terminate is when EOF (end-of-file) is reached, which will happen when the subprocess terminates. You don't want to wait till the process terminates, however, you only want to wait till its finished processing the line that was sent to it.

Futhermore, you're encountering issues with buffering as Chris Morgan has already pointed out. Another question on stackoverflow discusses how you can do non-blocking reads with subprocess. I've hacked up a quick and dirty adaptation of the code from that question to your problem:

def enqueue_output(out, queue):
    for line in iter(out.readline, ''):
        queue.put(line)
    out.close()

kw = {}
kw['executable'] = '/usr/bin/perl'
kw['shell'] = False
kw['stderr'] = subprocess.PIPE
kw['stdin'] = subprocess.PIPE
kw['stdout'] = subprocess.PIPE
args = ['-w','/path/script.perl',]
subproc = subprocess.Popen(args, **kw)
f = codecs.open('/path/mydata','r','UTF-8')
q = Queue.Queue()
t = threading.Thread(target = enqueue_output, args = (subproc.stdout, q))
t.daemon = True
t.start()
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    print "Sent:", line.strip()  ### code hangs after printing this ###
    try:
        line = q.get_nowait()
    except Queue.Empty:
        pass
    else:
        print "Received:", line.rstrip().decode('UTF-8')

subproc.terminate()
f.close()

It's quite likely that you'll need to make modifications to this code, but at least it doesn't block.

你不是我要的菜∠ 2024-12-30 12:09:16

谢谢斯格格。我也尝试过线程解决方案。然而,仅此解决方案总是挂起。我之前的代码和 srgerg 的代码都缺少最终的解决方案,你的提示给了我最后一个想法。

最终的解决方案写入足够的虚拟数据,迫使缓冲区中的最终有效行。为了支持这一点,我添加了跟踪写入 stdin 的有效行数的代码。线程循环打开输出文件,保存数据,并在读取行等于有效输入行时中断。该解决方案确保它可以逐行读取和写入任何大小的文件。

def std_output(stdout,outfile=''):
    out = 0
    f = codecs.open(outfile,'w','UTF-8')
    for line in iter(stdout.readline, ''):
        f.write('%s\n'%(line.rstrip().decode('UTF-8')))
        out += 1
        if i == out: break
    stdout.close()
    f.close()

outfile = '/path/myout'
infile = '/path/mydata'

subproc = subprocess.Popen(args,**kw)
t = threading.Thread(target=std_output,args=[subproc.stdout,outfile])
t.daemon = True
t.start()

i = 0
f = codecs.open(infile,'r','UTF-8')
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    i += 1
subproc.stdin.write('%s\n'%(' '*4096)) ### push dummy data ###
f.close()
t.join()
subproc.terminate()

Thanks srgerg. I had also tried the threading solution. This solution alone, however, always hung. Both my previous code and srgerg's code were missing the final solution, Your tip gave me one last idea.

The final solution writes enough dummy data force the final valid lines from the buffer. To support this, I added code that tracks how many valid lines were written to stdin. The threaded loop opens the output file, saves the data, and breaks when the read lines equal the valid input lines. This solution ensures it reads and writes line-by-line for any size file.

def std_output(stdout,outfile=''):
    out = 0
    f = codecs.open(outfile,'w','UTF-8')
    for line in iter(stdout.readline, ''):
        f.write('%s\n'%(line.rstrip().decode('UTF-8')))
        out += 1
        if i == out: break
    stdout.close()
    f.close()

outfile = '/path/myout'
infile = '/path/mydata'

subproc = subprocess.Popen(args,**kw)
t = threading.Thread(target=std_output,args=[subproc.stdout,outfile])
t.daemon = True
t.start()

i = 0
f = codecs.open(infile,'r','UTF-8')
for line in f:
    subproc.stdin.write('%s\n'%(line.strip().encode('UTF-8')))
    i += 1
subproc.stdin.write('%s\n'%(' '*4096)) ### push dummy data ###
f.close()
t.join()
subproc.terminate()
习ぎ惯性依靠 2024-12-30 12:09:16

请参阅手册中提到的有关使用 Popen.stdinPopen.stdout 的警告(就在 Popen.stdin):

警告:使用communicate()而不是.stdin.write.stdout.read.stderr.read 以避免由于任何其他操作系统管道缓冲区而导致死锁填充并阻塞子进程。

我意识到一次在内存中拥有一个半千兆字节的字符串并不是很理想,但是使用 communicate() 是一种工作的方式,正如您所观察到的,一旦操作系统管道缓冲区填满上, 的stdin.write() + stdout.read() 方式可能会陷入僵局。

使用 communicate() 对您来说可行吗?

See the warnings mentioned in the manual about using Popen.stdin and Popen.stdout (just above Popen.stdin):

Warning: Use communicate() rather than .stdin.write, .stdout.read or .stderr.read to avoid deadlocks due to any of the other OS pipe buffers filling up and blocking the child process.

I realise that having a gigabyte-and-a-half string in memory all at once isn't very desirable, but using communicate() is a way that will work, while as you've observed, once the OS pipe buffer fills up, the stdin.write() + stdout.read() way can become deadlocked.

Is using communicate() feasible for you?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文