音频字幕转录 - C++

发布于 2024-12-01 18:31:08 字数 613 浏览 7 评论 0原文

我正在进行一个项目，除其他视频相关任务外，最终应该能够提取视频的音频并对它应用某种语音识别，并获得视频中所说内容的转录文本。理想情况下，它应该输出某种字幕格式，以便文本链接到视频上的某个点。

我正在考虑使用 Microsoft Speech API（又名 SAPI）。但据我所知，它很难使用。我发现的语音识别的极少数示例（大多数是文本到语音，这更容易）表现不佳（它们不识别任何东西）。例如：http://msdn。 microsoft.com/en-us/library/ms717071%28v=vs.85%29.aspx

一些示例使用称为语法文件的文件，这些文件应该定义识别器正在等待，但由于我还没有彻底训练 Windows 语音识别，我认为这可能会掺假结果。

所以我的问题是......对于这样的事情最好的工具是什么？您能同时提供付费和免费选项吗？嗯，最好的“免费”（Windows 附带）选项我相信它是 SAPI，其余的都应该付费，但如果它们真的很好，那可能是值得的。另外，如果您有任何关于在与此类似的上下文中使用 SAPI（或其他 API）的好教程，那就太好了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

摇划花蜜的午后 2024-12-08 18:31:08

总的来说，这是一个很大的要求！

任何语音识别系统的问题在于，它在训练后才能发挥最佳功能。它需要上下文（预期的单词）和某种音频基准（每个声音听起来像什么）。在某些情况下，这可能是可能的，例如在电视连续剧中，如果您想通过几个小时的语音（针对每个角色分开）来训练它。不过那里还有很多工作要做。对于像电影这样的东西，除非你能找到演员，否则可能没有希望训练识别器。

大多数影视制作公司只是聘请媒体公司来转录字幕，或者使用人工操作员直接转录，或者转换脚本。事实上，他们仍然需要人类参与这些巨大的操作，这表明自动化系统还无法胜任。

在视频中，有太多的事情让你的生活变得困难，几乎涵盖了当前语音技术研究的大量内容：

->多个扬声器 -> “说话人识别”（你能区分角色吗？另外，字幕通常针对不同的说话人有不同颜色的文本）

->多个同时发言者 -> “鸡尾酒会问题”——你能将两个语音成分分开并转录吗？

->背景噪音->你能从任何配乐/拟音/爆炸直升机中挑选出演讲吗？

语音算法需要非常强大，因为不同的角色可能有不同的性别/口音/情感。根据我对当前识别状态的了解，经过一些培训后，您可能能够获得单个发言者，但要求一个程序来解决所有这些问题可能很困难！

据我所知，没有“字幕”格式。我建议使用 Tiresias Screenfont 之类的字体保存文本图像，该字体专门用于专为这些情况下的易读性而设计，并使用查找表根据视频时间码交叉引用图像（记住 NTSC/PAL/Cinema 使用不同的计时格式）。

——

有很多专有的语音识别系统。如果您想要最好的，您可能会想要从像 Nuance 这样的大公司之一获得解决方案的许可。如果您想保持免费，RWTH 和 CMU 整理了一些解决方案。我不知道它们有多好，也不知道它们有多适合解决这个问题。

我能想到的与您的目标类似的唯一解决方案是您可以在英国新闻频道上获得的字幕“实时隐藏式字幕”。由于它是实时的，我假设他们使用某种针对读者进行过训练的语音识别系统（尽管它可能没有经过训练，我不确定）。这几年虽然有所好转，但总体来说还是很差。它面临的最大困难似乎是速度。对话通常非常快，因此实时字幕还有一个额外的问题，即如何及时完成所有工作。实时隐藏式字幕经常会落后，并且必须错过很多内容才能赶上。

您是否必须处理这个问题取决于您是否要为“实时”视频添加字幕或者是否可以对其进行预处理。为了处理上述所有其他复杂情况，我假设您需要对其进行预处理。

尽管我讨厌引用大写的 W，但这里有很多有用的链接！

祝你好运：）

On the whole this is a big ask!

The issue with any speech recognition system is that it functions best after training. It needs context (what words to expect) and some kind of audio benchmark (what does each voice sound like). This might be possible in some cases, such as a TV series if you wanted to churn through hours of speech -separated for each character- to train it. There's a lot of work there though. For something like a film there's probably no hope of training a recogniser unless you can get hold of the actors.

Most film and TV production companies just hire media companies to transcribe the subtitles based on either direct transcription using a human operator, or converting the script. The fact that they still need humans in the loop for these huge operations suggests that automated systems just aren't up to it yet.

In video you have a plethora of things that make you life difficult, pretty much spanning huge swathes of current speech technology research:

-> Multiple speakers -> "Speaker Identification" (can you tell characters apart? Also, subtitles normally have different coloured text for different speakers)

-> Multiple simultaneous speakers -> The "cocktail party problem" - can you separate the two voice components and transcribe both?

-> Background noise -> Can you pick the speech out from any soundtrack/foley/exploding helicopters.

The speech algorithm will need to be extremely robust as different characters can have different gender/accents/emotion. From what I understand of the current state of recognition you might be able to get a single speaker after some training, but asking a single program to nail all of them might be tough!

There is no "subtitle" format that I'm aware of. I would suggest saving an image of the text using a font like Tiresias Screenfont that's specifically designed for legibility in these circumstances, and use a lookup table to cross-reference images against video timecode (remembering NTSC/PAL/Cinema use different timing formats).

There's a bunch of proprietary speech recognition systems out there. If you want the best you'll probably want to license a solution off one of the big boys like Nuance. If you want to keep things free the universities of RWTH and CMU have put some solutions together. I have no idea how good they are or how well they might be suited to the problem.

The only solution I can think of similar to what you're aiming at is the subtitling you can get on news channels here in the UK "Live Closed Captioning". Since it's live, I assume they use some kind of speech recognition system trained to the reader (although it might not be trained, I'm not sure). It's got better over the past few years, but on the whole it's still pretty poor. The biggest thing it seems to struggle with is speed. Dialogue is normally really fast, so live subtitling has the extra issue of getting everything done in time. Live closed captions quite frequently get left behind and have to miss a lot of content out to catch up.

Whether you have to deal with this depends on whether you'll be subtitling "live" video or if you can pre-process it. To deal with all the additional complications above I assume you'll need to pre-process it.

As much as I hate citing the big W there's a goldmine of useful links here!

Good luck :)

回复收藏 0 原文

撩心不撩汉 2024-12-08 18:31:08

这属于听写的范畴，这是一项非常大的词汇量任务。像 Dragon Naturallyspoken 这样的产品非常好，并且为开发人员提供了 SAPI 接口。但这并不是一个简单的问题。

通常，听写产品应该是单个扬声器，最好的产品会自动适应该扬声器，从而改进底层声学模型。他们还拥有复杂的语言模型，可以通过限制所谓的词汇的复杂性来限制任何给定时刻的问题。这是一种奇特的说法，系统正在弄清楚你在说什么，从而确定接下来可能或不可能出现什么类型的单词和短语。

不过，将一个非常好的听写系统应用到您的录音中并看看它的效果如何会很有趣。我对付费系统的建议是从 Nuance 获取 Dragon Naturallyspoken 并获取开发者 API。我相信提供了一个SAPI接口，它的好处是允许你交换微软语音或任何其他支持SAPI的ASR引擎。 IBM 将是另一个值得关注的供应商，但我认为您不会比 Dragon 做得更好。

但效果不会很好！完成集成 ASR 引擎的所有工作后，您可能会发现错误率相当高（可能是一半）。这是由于这项任务中的一些主要挑战：

1）多个扬声器，这会降低声学模型和适应性。
2) 背景音乐和音效。
3）混合言论——人们互相交谈。
4）缺乏适合该任务的良好语言模型。

1）如果你有办法将每个演员分开在一个单独的轨道上，那将是理想的。但目前还没有一种可靠的方法可以自动分离讲话者，而这种方式足以满足语音识别器的需要。如果每个扬声器的音高明显不同，您可以尝试音高检测（有一些免费软件）并据此进行分离，但这是一项复杂且容易出错的任务。）最好的办法是手动编辑扬声器分开，但您也可以在此时手动转录演讲！如果您可以让演员处于不同的轨道上，则需要使用不同的用户配置文件运行 ASR。

对于音乐 (2)，您要么必须抱有最好的希望，要么尝试将其过滤掉。语音比音乐的频带限制更大，因此您可以尝试使用带通滤波器来衰减除语音频带之外的所有内容。您可能想要尝试截止值，但我猜想 100Hz 到 2-3KHz 将使语音保持清晰。

对于(3)，无解。 ASR 引擎应该返回置信度分数，因此我最多会说，如果您可以标记低分，那么您可以返回并手动转录这些语音片段。

(4) 对于语音科学家来说是一项复杂的任务。最好的选择是搜索针对电影主题制作的现有语言模型。实际上，请与 Nuance 或 IBM 联系。也许他们可以为您指明正确的方向。

希望这有帮助。

This falls into the category of dictation, which is a very large vocabulary task. Products like Dragon Naturally Speaking are amazingly good and that has a SAPI interface for developers. But it's not so simple of a problem.

Normally a dictation product is meant to be single speaker and the best products adapt automatically to that speaker, thereby improving the underlying acoustic model. They also have sophisticated language modeling which serves to constrain the problem at any given moment by limiting what is known as the perplexity of the vocabulary. That's a fancy way of saying the system is figuring out what you're talking about and therefore what types of words and phrases are likely or not likely to come next.

It would be interesting though to apply a really good dictation system to your recordings and see how well it does. My suggestion for a paid system would be to get Dragon Naturally Speaking from Nuance and get the developer API. I believe that provides a SAPI interface, which has the benefit of allowing you to swap in the Microsoft speech or any other ASR engine that supports SAPI. IBM would be another vendor to look at but I don't think you will do much better than Dragon.

But it won't work well! After all the work of integrating the ASR engine, what you will probably find is that you get a pretty high error rate (maybe half). That would be due to a few major challenges in this task:

1) multiple speakers, which will degrade the acoustic model and adaptation.
2) background music and sound effects.
3) mixed speech - people talking over each other.
4) lack of a good language model for the task.

For 1) if you had a way of separating each actor on a separate track that would be ideal. But there's no reliable way of separating speakers automatically in a way that would be good enough for a speech recognizer. If each speaker were at a distinctly different pitch, you could try pitch detection (some free software out there for that) and separate based on that, but this is a sophisticated and error prone task.) The best thing would be hand editing the speakers apart, but you might as well just manually transcribe the speech at that point! If you could get the actors on separate tracks, you would need to run the ASR using different user profiles.

For music (2) you'd either have to hope for the best or try to filter it out. Speech is more bandlimited than music so you could try a bandpass filter that attenuates everything except the voice band. You would want to experiment with the cutoffs but I would guess 100Hz to 2-3KHz would keep the speech intelligible.

For (3), there's no solution. The ASR engine should return confidence scores so at best I would say if you can tag low scores, you could then go back and manually transcribe those bits of speech.

(4) is a sophisticated task for a speech scientist. Your best bet would be to search for an existing language model made for the topic of the movie. Talk to Nuance or IBM, actually. Maybe they could point you in the right direction.

Hope this helps.

回复收藏 0 原文

~没有更多了~