如何在Huggingsound中获得相对于音频时间的字母位置？

发布于 2025-02-06 04:55:00 字数 115 浏览 1 评论 0原文

因此，我使用STT模型（SecemREcognitionModel）。我得到了如何获得句子，但我想知道如何获得相应的音频时间来输出字母。那么，如何在拥抱面中获得相对于音频时间的字母位置？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌辣 2025-02-13 04:55:00

huggingsound 创建者在这里！ 转录方法返回的转录包含您传递给方法的音频的启动/结束时间戳...

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

transcriptions = model.transcribe(audio_paths)

t = transcriptions[0] # getting the first audio transcription

print(t)

# {
#   "transcription": "extraordinary claims require extraordinary evidence", 
#   "start_timestamps": [100, 120, 140, 180, ...], <- milliseconds
#   "end_timestamps": [120, 140, 180, 200, ...], <- milliseconds
#   "probabilities": [0.95, 0.88, 0.9, 0.97, ...]
# }

start_timestamps [i]为您提供时间在转录中检测到第i章字母时，以毫秒为单位。 end_timestamps [i]在转录中停止检测到i-the字母时，给您时间以毫秒为单位。

因此，您可以使用start_timestamps和end_timestamps列表来获取时间，甚至是音频文件的每个字母的持续时间：

# What's the first letter of the audio file?
print(t["transcription"][0]) # e

# When's the detection of the first letter of the audio?
print(t["start_timestamps"][0]) # at 100ms
    
# How long is the first letter of the audio?
print(t["end_timestamps"][0] - t["start_timestamps"][0]) # (120-100) = 20ms long

...时间：

# What's the letter at 108ms?
for i, letter in enumerate(t["transcription"]):
    if t["start_timestamps"][i] <= 108 <= t["end_timestamps"][i]:
        print(letter) # e
        break

The HuggingSound creator here! The transcriptions returned by the transcribe method contain the start/end timestamps in milliseconds of the audios that you passed to the method...

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

transcriptions = model.transcribe(audio_paths)

t = transcriptions[0] # getting the first audio transcription

print(t)

# {
#   "transcription": "extraordinary claims require extraordinary evidence", 
#   "start_timestamps": [100, 120, 140, 180, ...], <- milliseconds
#   "end_timestamps": [120, 140, 180, 200, ...], <- milliseconds
#   "probabilities": [0.95, 0.88, 0.9, 0.97, ...]
# }

The start_timestamps[i] gives you the time in milliseconds when the i-th letter was detected in the transcription. The end_timestamps[i] gives you the time in milliseconds when the i-th letter stopped being detected in the transcription.

So you can use the start_timestamps and end_timestamps lists to get the timing and even the duration of each letter of an audio file:

# What's the first letter of the audio file?
print(t["transcription"][0]) # e

# When's the detection of the first letter of the audio?
print(t["start_timestamps"][0]) # at 100ms
    
# How long is the first letter of the audio?
print(t["end_timestamps"][0] - t["start_timestamps"][0]) # (120-100) = 20ms long

... And to get the letter position given a time:

# What's the letter at 108ms?
for i, letter in enumerate(t["transcription"]):
    if t["start_timestamps"][i] <= 108 <= t["end_timestamps"][i]:
        print(letter) # e
        break

回复收藏 0 原文

~没有更多了~