如何在Python中使用GTTS运行多处理池?

发布于 2025-01-31 21:37:23 字数 2837 浏览 3 评论 0原文

我正在尝试在不使用AWS Polly等云服务的情况下进行批量转换。

gtts 为语音提供了高质量的文本,但需要互联网连接才能获得结果。例如,使用此代码的单个字符串运行 tts 非常慢。

POST_ID摘要是分别具有新闻文章ID和摘要的数据框的2列。我正在Windows 10上的Python文件中使用Visual Studio代码运行。

from gtts import gTTS

for row_id, row_summary in zip(df.post_id, df.summary):
        tts = gTTS(row_summary, lang='en', tld='ca')
        tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')

这有效,但需要半个小时的时间来完成100个摘要,这很慢。

我尝试使用这样的池:

from gtts import gTTS
from multiprocessing import Pool, get_context

def generate_audio(row_id, row_summary):
         tts = gTTS(row_summary, lang='en', tld='ca')
         file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
         tts.save(file_name)
         return None

pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
         p.starmap(generate_audio, pool_input)

但是我最终会遇到此错误:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path, 
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'

这是否意味着它不会在池中运行,除非 gtts 库的开发人员可以启用它?还是我只是在这里做错了什么?

编辑:另外,在执行此操作之前,我要创建一个称为 summary_audio 的文件夹,以将文件保存到不存在的情况下。

I'm trying to do bulk text to audio conversion without using cloud services like AWS polly.

gtts gives good quality text to speech but requires an internet connection to get results. Running tts for individual strings using this code for example is very slow.

post_id and summary are 2 columns of a dataframe that have ids and summaries of news articles respectively. I'm running in Visual Studio Code in a Python file on Windows 10.

from gtts import gTTS

for row_id, row_summary in zip(df.post_id, df.summary):
        tts = gTTS(row_summary, lang='en', tld='ca')
        tts.save('.\summary_audio\gtts_summary_'+str(row_id)+'.mp3')

This works but takes half an hour for 100 summaries, which is slow.

I've tried using pooling like so:

from gtts import gTTS
from multiprocessing import Pool, get_context

def generate_audio(row_id, row_summary):
         tts = gTTS(row_summary, lang='en', tld='ca')
         file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
         tts.save(file_name)
         return None

pool_input = list(zip(df.post_id, df.summary))
with get_context("spawn").Pool() as p:
         p.starmap(generate_audio, pool_input)

But I end up getting this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path, 
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <module>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
  File "c:\Users\Admin\Desktop\Projects\NLP_Cron\cron_audio.py", line 100, in <listcomp>
    [os.remove(f) for f in os.listdir() if f.endswith(".mp3")]
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'gtts_summary_4a063dea-9469-4633-87bc-b4a2f2e3f1ba.mp3'

Does this mean it won't run in a pool unless the developer of the gtts library enables it? Or am I simply doing something wrong here?

Edit: Also before doing this, I'm creating a folder called summary_audio to save the files to if it doesn't exist.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

白昼 2025-02-07 21:37:23

我要回答自己的问题,因为我意识到我正在函数中运行池进程。池语句必须在主部分中运行。只有这样,它按预期工作。这是一个完整的例子。

from multiprocessing import Pool, get_context
import pandas as pd
from gtts import gTTS

df_col1 = [1,3,4,5,6,7]
df_col2 = ["Hi", "Bye", "Why?", "Cry", "Die", "Pie", "Shy"]

df = pd.DataFrame(zip(df_col1, df_col2), columns = ['post_id', 'summary'])

def generate_audio(row_id, row_summary):
         tts = gTTS(row_summary, lang='en', tld='ca')
         file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
         return tts.save(file_name)

if __name__ == '__main__':

    pool_input = list(zip(df.post_id, df.summary))
    with Pool(3) as p:
         p.starmap(generate_audio, pool_input)
    p.join()

同样,首先将GTTS与合并的GTT一起使用是没有意义的,因为您将获得费率限制。

I'm answering my own question because i realize I was running the pool process inside a function. The pool statement has to run in the main section. Only then it works as intended. Here is a full example.

from multiprocessing import Pool, get_context
import pandas as pd
from gtts import gTTS

df_col1 = [1,3,4,5,6,7]
df_col2 = ["Hi", "Bye", "Why?", "Cry", "Die", "Pie", "Shy"]

df = pd.DataFrame(zip(df_col1, df_col2), columns = ['post_id', 'summary'])

def generate_audio(row_id, row_summary):
         tts = gTTS(row_summary, lang='en', tld='ca')
         file_name = '.\summary_audio\gtts_summary_'+str(row_id)+'.mp3'
         return tts.save(file_name)

if __name__ == '__main__':

    pool_input = list(zip(df.post_id, df.summary))
    with Pool(3) as p:
         p.starmap(generate_audio, pool_input)
    p.join()

Also it wont make sense to use gtts with pooling in the first place because you will get rate limited.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文