如何在Python中使用GTTS运行多处理池?
我正在尝试在不使用AWS Polly等云服务的情况下进行批量转换。
gtts 为语音提供了高质量的文本,但需要互联网连接才能获得结果。例如,使用此代码的单个字符串运行 tts 非常慢。
POST_ID
和摘要
是分别具有新闻文章ID和摘要的数据框的2列。我正在Windows 10上的Python文件中使用Visual Studio代码运行。
from gtts import gTTS
for row_id, row_summary in zip(df.post_id, df.summary):
tts = gTTS(row_summary, lang='en', tld='ca')
tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')
这有效,但需要半个小时的时间来完成100个摘要,这很慢。
我尝试使用这样的池:
from gtts import gTTS
from multiprocessing import Pool, get_context
def generate_audio(row_id, row_summary):
tts = gTTS(row_summary, lang='en', tld='ca')
file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
tts.save(file_name)
return None
pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
p.starmap(generate_audio, pool_input)
但是我最终会遇到此错误:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'
这是否意味着它不会在池中运行,除非 gtts 库的开发人员可以启用它?还是我只是在这里做错了什么?
编辑:另外,在执行此操作之前,我要创建一个称为 summary_audio 的文件夹,以将文件保存到不存在的情况下。
I'm trying to do bulk text to audio conversion without using cloud services like AWS polly.
gtts gives good quality text to speech but requires an internet connection to get results. Running tts for individual strings using this code for example is very slow.
post_id
and summary
are 2 columns of a dataframe that have ids and summaries of news articles respectively. I'm running in Visual Studio Code in a Python file on Windows 10.
from gtts import gTTS
for row_id, row_summary in zip(df.post_id, df.summary):
tts = gTTS(row_summary, lang='en', tld='ca')
tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')
This works but takes half an hour for 100 summaries, which is slow.
I've tried using pooling like so:
from gtts import gTTS
from multiprocessing import Pool, get_context
def generate_audio(row_id, row_summary):
tts = gTTS(row_summary, lang='en', tld='ca')
file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
tts.save(file_name)
return None
pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
p.starmap(generate_audio, pool_input)
But I end up getting this error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
[os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'
Does this mean it won't run in a pool unless the developer of the gtts library enables it? Or am I simply doing something wrong here?
Edit: Also before doing this, I'm creating a folder called summary_audio to save the files to if it doesn't exist.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我要回答自己的问题,因为我意识到我正在函数中运行池进程。池语句必须在主部分中运行。只有这样,它按预期工作。这是一个完整的例子。
同样,首先将GTTS与合并的GTT一起使用是没有意义的,因为您将获得费率限制。
I'm answering my own question because i realize I was running the pool process inside a function. The pool statement has to run in the main section. Only then it works as intended. Here is a full example.
Also it wont make sense to use gtts with pooling in the first place because you will get rate limited.