Python子进程;无法读取标准输出
我有大约 500,000 多个 txt 文件,总共大约 7 GB 的数据。我正在使用 python 将它们放入 sqlite 数据库中。我正在创建 2 个表,1. 是 pK 和文件的超链接。 对于另一个表,我使用的是同事用 Perl 开发的实体提取器。
为了实现这一点,我使用 subprocess.Popen()。在使用此方法之前,我在循环的每次迭代中都打开 perl,但它的使用成本太高了。
我需要 Perl 是动态的,我需要能够从它发送回来和第四个数据,并且进程不会终止,直到我告诉它这样做。 Perl 已被修改,因此 Perl 接受文件的完整字符串作为标准输入,并在获得 \n 时给我一个标准输出。但我在读取数据时遇到问题...
如果我使用通信,在循环的下一次迭代中,我的子进程将终止,我会收到 I/O 错误。如果我尝试使用 readline() 或 read(),它会锁定。以下是我所经历的不同行为的一些示例。
这使我的系统陷入僵局,我需要强制关闭 python 才能继续。
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout.read() #I can not see the STDOUT, python freezes and does not run past here.
print x
这会取消子进程,并且在循环的下一次迭代中出现 I/O 错误。
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.communicate() #Works good, I can see my STDOUT from perl but the process terminates and will not run on the next iteration
print x
如果我像这样运行它,它就会很好地运行所有代码。打印行是 ', mode 'rb' at 0x015dbf08>对于我的文件夹中的每个项目。
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout #I can not get the value of the object, but it runs through all my files fine.
print x
希望我犯了一个简单的错误,但是有什么方法可以将文件发送到我的 perll (标准输入),获取标准输出,然后重复,而不必为循环中的每个文件重新打开我的子进程?
I have about 500,000+ txt file for about 7+ gigs of data total. I am using python to put them into a sqlite database. I am creating 2 tables, 1. is the pK and the hyperlink to the file.
For the other table I am using an entity extractor that was devloped in perl by a coworker.
To accomplish this I am using subprocess.Popen(). T Prior to this method I was opening the perl at every iteration of my loop, but it was simply to expensive to be useful.
I need the perl to be dynamic, I need to be able to send data back and fourth from it and the process not terminate untilI tell it to do so. The perl was modified so it perl accepts the full string of a file as a stdin, and gives me a stdout when it gets a \n. But I am having trouble reading data...
If I use communicate, at the next iteration in my loop my subprocess is terminated, I get an I/O error. If I try and use readline() or read(), it locks up. Here are some examples of the differant behavior I am experiancing.
This deadlocks my system and I need to force close python to continue.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout.read() #I can not see the STDOUT, python freezes and does not run past here.
print x
This cancels the subprocess and I get an I/O error at the next iteration of my loop.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.communicate() #Works good, I can see my STDOUT from perl but the process terminates and will not run on the next iteration
print x
If I just run it like this, It runs through all the code fine. the print line is ', mode 'rb' at 0x015dbf08> for each item in my folder.
numberExtractor = subprocess.Popen(["C:\\Perl\\bin\\perl5.10.0.exe","D:\\MyDataExtractor\\extractSerialNumbers.pl"], stdout=subprocess.PIPE, stdin= subprocess.PIPE)
for infile in glob.glob(self.dirfilename + '\\*\\*.txt'):
f = open(infile)
reportString = f.read()
f.close()
reportString = reportString.replace('\n',' ')
reportString = reportString.replace('\r',' ')
reportString = reportString +'\n'
numberExtractor.stdin.write(reportString)
x = numberExtractor.stdout #I can not get the value of the object, but it runs through all my files fine.
print x
Hopefully I am making a simple mistake, but is there some way I can just send a file to my perll (stdin), get the stdout, and then repeat without having to reopen my subprocess for every file in my loop?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
考虑使用外壳。生活更简单了。
不要乱搞让 Python 启动 perl 之类的事情。只需从 Perl 读取结果并在 Python 中处理这些结果即可。
由于两个进程同时运行,因此速度往往相当快,并且会使用大量 CPU 资源,而无需您进行太多编程。
在Python程序(load_database.py)中,您可以简单地使用fileinput模块来读取stdin上提供的整个文件。
如果让 shell 完成设置管道的脏活,这就是 Python 程序中所需的全部内容。
Consider using the shell. Life is simpler.
Don't mess around with having Python start perl and all that. Just read the results from perl and process those results in Python.
Since both processes run concurrently, this tends to be pretty fast and use a lot of CPU resources without much programming on your part.
In the Python program (load_database.py) you can simply use
fileinput
module to read the entire file provided on stdin.That's about all you need in the Python program if you make the shell do the dirty work of setting up the pipeline.