用于语音检测和辨别的开源代码

发布于 2024-11-02 18:52:15 字数 414 浏览 8 评论 0原文

我有 15 盘录音带，我相信其中一盘包含我祖母和我谈话的旧录音。快速尝试找到合适的地方并没有找到。我不想听20个小时的磁带才能找到它。该位置可能不在其中一盘磁带的开头。大多数内容似乎分为三类——按照总长度的顺序，最长的在前：静音、语音广播和音乐。

我计划将所有磁带转换为数字格式，然后再次查找录音。最明显的方法是在我做其他事情时在后台播放它们。这对我来说太简单了，所以：是否有任何开源库或其他代码，可以让我按照复杂性和实用性的顺序找到：

非静音区域
包含人类语音的区域
包含我自己语音的区域（以及我祖母的）

我更喜欢 Python、Java 或 C。

由于我对该领域一无所知，所以如果没有答案，有关搜索术语的提示将不胜感激。

我知道我很容易会在这上面花费 20 多个小时。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空‖城人不在 2024-11-09 18:52:16

我之前写过一篇关于使用 Windows 语音识别的博客文章。我有一个关于用 C# 将音频文件转换为文本的基本教程。您可以查看此处。

回复收藏 0 原文

转瞬即逝 2024-11-09 18:52:16

我从这里开始，

http://alize.univ-avignon.fr/

http://www-lium.univ-lemans.fr/diarization/doku.php /quick_start

codeblocks:: 对 gcc 有好处

回复收藏 0 原文

忘年祭陌 2024-11-09 18:52:16

尝试大胆+以频谱图（logf）形式查看轨迹+训练你的眼睛（！）来识别语音。您需要调整时间尺度和 FFT 窗口。

回复收藏 0 原文

婴鹅 2024-11-09 18:52:15

大多数时候你可能会节省的就是说话人分类。它的工作原理是用说话者 ID 注释录音，然后您可以轻松地将其手动映射到真实的人。错误率通常约为记录长度的 10-15%，这听起来很糟糕，但这包括检测太多说话者并将两个 ID 映射到同一个人，这并不难修复。

其中一个很好的工具是 SHoUT 工具包 (C++)，尽管它对输入格式有点挑剔。请参阅作者对该工具的用法。它输出语音/语音活动检测元数据和说话人分类，这意味着您获得第一点和第二点 (VAD/SAD) 以及一点额外的数据，因为它注释了何时相同录音中发言者处于活动状态。

另一个有用的工具是 LIUM spkdiarization (Java)，它基本上可以相同，只是我还没有付出足够的努力来弄清楚如何获取 VAD 元数据。它具有一个很好的即用型可下载包。

通过一点点编译，这应该会在一小时内完成。

回复收藏 0 原文

独自唱情﹋歌 2024-11-09 18:52:15

最好的选择是找到一个可以进行语音识别或说话人识别（而不是语音识别）的开源模块。说话人识别用于识别特定说话人，而语音识别是将语音转换为文本。可能有开源的说话人识别包，您可以尝试在 SourceForge.net 等搜索“说话人识别”或“语音和生物识别”。由于我自己没有使用过，所以无法推荐任何东西。

如果您找不到任何东西，但您有兴趣自行开发一个，那么有大量适用于任何流行语言的开源 FFT 库。技术是：

以数字形式获取您正常说话和您祖母正常说话的典型录音，背景噪音尽可能小
- 对目标录音中的每一秒音频进行 FFT
- 从您创建的 FFT 配置文件数组中，滤除低于特定平均能量阈值的任何配置文件，因为它们很可能是噪声
- 通过对未过滤的 FFT 快照进行平均来构建主 FFT 配置文件
- 然后对数字化目标音频（20 小时的内容）重复上述 FFT 采样技术
- 标记目标音频文件中任何时间索引的 FFT 快照与您和您祖母谈话的主 FFT 配置文件相似的任何区域。您需要使用相似性设置，以免出现太多误报。另请注意，您可能必须将 FFT 频率仓的比较限制为仅与主 FFT 配置文件中具有能量的频率仓进行比较。否则，如果您和您祖母谈话的目标音频包含明显的背景噪音，则会影响您的相似度函数。
- 制定手动检查的时间索引列表

请注意，完成该项目的小时数很容易超过手动收听录音的 20 小时。但这比磨练 20 小时的音频要有趣得多，而且您将来可以再次使用您构建的软件。

当然，如果从隐私角度来看音频不敏感，您可以将音频试听任务外包给亚马逊的 Mechanical Turk 之类的公司。

The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.

If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:

Get a typical recording of you talking normally and your grandmother talking normally in digital form, something with as little background noise as possible
- Take the FFT of every second of audio or so in the target recordings
- From the array of FFT profiles you have created, filter out any below a certain average energy threshold since they are most likely noise
- Build a master FFT profile by averaging out the non-filtered FFT snapshots
- Then repeat the FFT sampling technique above on the digitized target audio (the 20 hours of stuff)
- Flag any areas in the target audio files where the FFT snapshot at any time index is similar to your master FFT profile for you and your grandmother talking. You will need to play with the similarity setting so that you don't get too many false positives. Also note, you may have to limit your FFT frequency bin comparison to only those frequency bins in your master FFT profile that have energy. Otherwise, if the target audio of you and your grandmother talking contains significant background noise, it will throw off your similarity function.
- Crank out a list of time indices for manual inspection

Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.

Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.

回复收藏 0 原文

︶葆Ⅱㄣ 2024-11-09 18:52:15

您还可以尝试 pyAudioAnalysis 来：

删除静音：

from pyAudioAnalysis import audioBasicIO as aIO 从 pyAudioAnalysis 导入 audioSegmentation as as [Fs, x] = aIO.readAudioFile("data/recording1.wav") snippets = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3,plot = True)

segments 包含非静音片段的端点。