如何在Huggingsound中获得相对于音频时间的字母位置?

发布于 2025-02-06 04:55:00 字数 115 浏览 1 评论 0原文

因此,我使用STT模型(SecemREcognitionModel)。我得到了如何获得句子,但我想知道如何获得相应的音频时间来输出字母。那么,如何在拥抱面中获得相对于音频时间的字母位置?

So I use a STT model (SpeechRecognitionModel). I get how to get a sentence, yet I wonder how to get a corresponding audio timings to outputted letters. So how to get letters position relative to audio time in huggingface?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

萌辣 2025-02-13 04:55:00

huggingsound 创建者在这里! 转录方法返回的转录包含您传递给方法的音频的启动/结束时间戳...

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

transcriptions = model.transcribe(audio_paths)

t = transcriptions[0] # getting the first audio transcription

print(t)

# {
#   "transcription": "extraordinary claims require extraordinary evidence", 
#   "start_timestamps": [100, 120, 140, 180, ...], <- milliseconds
#   "end_timestamps": [120, 140, 180, 200, ...], <- milliseconds
#   "probabilities": [0.95, 0.88, 0.9, 0.97, ...]
# }

start_timestamps [i]为您提供时间在转录中检测到第i章字母时,以毫秒为单位。 end_timestamps [i]在转录中停止检测到i-the字母时,给您时间以毫秒为单位。

因此,您可以使用start_timestampsend_timestamps列表来获取时间,甚至是音频文件的每个字母的持续时间:

# What's the first letter of the audio file?
print(t["transcription"][0]) # e

# When's the detection of the first letter of the audio?
print(t["start_timestamps"][0]) # at 100ms
    
# How long is the first letter of the audio?
print(t["end_timestamps"][0] - t["start_timestamps"][0]) # (120-100) = 20ms long

...时间:

# What's the letter at 108ms?
for i, letter in enumerate(t["transcription"]):
    if t["start_timestamps"][i] <= 108 <= t["end_timestamps"][i]:
        print(letter) # e
        break

The HuggingSound creator here! The transcriptions returned by the transcribe method contain the start/end timestamps in milliseconds of the audios that you passed to the method...

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

transcriptions = model.transcribe(audio_paths)

t = transcriptions[0] # getting the first audio transcription

print(t)

# {
#   "transcription": "extraordinary claims require extraordinary evidence", 
#   "start_timestamps": [100, 120, 140, 180, ...], <- milliseconds
#   "end_timestamps": [120, 140, 180, 200, ...], <- milliseconds
#   "probabilities": [0.95, 0.88, 0.9, 0.97, ...]
# }

The start_timestamps[i] gives you the time in milliseconds when the i-th letter was detected in the transcription. The end_timestamps[i] gives you the time in milliseconds when the i-th letter stopped being detected in the transcription.

So you can use the start_timestamps and end_timestamps lists to get the timing and even the duration of each letter of an audio file:

# What's the first letter of the audio file?
print(t["transcription"][0]) # e

# When's the detection of the first letter of the audio?
print(t["start_timestamps"][0]) # at 100ms
    
# How long is the first letter of the audio?
print(t["end_timestamps"][0] - t["start_timestamps"][0]) # (120-100) = 20ms long

... And to get the letter position given a time:

# What's the letter at 108ms?
for i, letter in enumerate(t["transcription"]):
    if t["start_timestamps"][i] <= 108 <= t["end_timestamps"][i]:
        print(letter) # e
        break
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文