用于语音检测和辨别的开源代码
我有 15 盘录音带,我相信其中一盘包含我祖母和我谈话的旧录音。快速尝试找到合适的地方并没有找到。我不想听20个小时的磁带才能找到它。该位置可能不在其中一盘磁带的开头。大多数内容似乎分为三类——按照总长度的顺序,最长的在前:静音、语音广播和音乐。
我计划将所有磁带转换为数字格式,然后再次查找录音。最明显的方法是在我做其他事情时在后台播放它们。这对我来说太简单了,所以:是否有任何开源库或其他代码,可以让我按照复杂性和实用性的顺序找到:
- 非静音区域
- 包含人类语音的区域
- 包含我自己语音的区域(以及我祖母的)
我更喜欢 Python、Java 或 C。
由于我对该领域一无所知,所以如果没有答案,有关搜索术语的提示将不胜感激。
我知道我很容易会在这上面花费 20 多个小时。
I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn't turn it up. I don't want to listen to 20 hours of tape to find it. The location may not be at the start of one of the tapes. Most of the content seems to fall into three categories -- in order of total length, longest first: silence, speech radio, and music.
I plan to convert all of the tapes to digital format, and then look again for the recording. The obvious way is to play them all in the background while I'm doing other things. That's far too straightforward for me, so: Are there any open source libraries, or other code, that would allow me to find, in order of increasing sophistication and usefulness:
- Non-silent regions
- Regions containing human speech
- Regions containing my own speech (and that of my grandmother)
My preference is for Python, Java, or C.
Failing answers, hints about search terms would be appreciated since I know nothing about the field.
I understand that I could easily spend more than 20 hours on this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我之前写过一篇关于使用 Windows 语音识别的博客文章。我有一个关于用 C# 将音频文件转换为文本的基本教程。您可以查看此处。
I wrote a blog article ago about using Windows speech recognition. I have a basic tutorial on converting audio files to text in C#. You can check out here.
我从这里开始,
http://alize.univ-avignon.fr/
http://www-lium.univ-lemans.fr/diarization/doku.php /quick_start
codeblocks:: 对 gcc 有好处
I'd start here,
http://alize.univ-avignon.fr/
http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start
codeblocks:: is good for gcc
尝试大胆+以频谱图(logf)形式查看轨迹+训练你的眼睛(!)来识别语音。您需要调整时间尺度和 FFT 窗口。
Try audacity + view track as spectrogram(logf) + train your eyes(!) to recognize speech. You will need to tune time scale and FFT window.
大多数时候你可能会节省的就是说话人分类。它的工作原理是用说话者 ID 注释录音,然后您可以轻松地将其手动映射到真实的人。错误率通常约为记录长度的 10-15%,这听起来很糟糕,但这包括检测太多说话者并将两个 ID 映射到同一个人,这并不难修复。
其中一个很好的工具是 SHoUT 工具包 (C++),尽管它对输入格式有点挑剔。请参阅作者对该工具的用法。它输出语音/语音活动检测元数据和说话人分类,这意味着您获得第一点和第二点 (VAD/SAD) 以及一点额外的数据,因为它注释了何时相同录音中发言者处于活动状态。
另一个有用的工具是 LIUM spkdiarization (Java),它基本上可以相同,只是我还没有付出足够的努力来弄清楚如何获取 VAD 元数据。它具有一个很好的即用型可下载包。
通过一点点编译,这应该会在一小时内完成。
What you probably save you most of the time is speaker diarization. This works by annotating the recording with speaker IDs, which you can then manually map to real people with very little effort. The errors rates are typically at about 10-15% of record length, which sounds awful, but this includes detecting too many speakers and mapping two IDs to same person, which isn't that hard to mend.
One such good tool is SHoUT toolkit (C++), even though it's a bit picky about input format. See usage for this tool from author. It outputs voice/speech activity detection metadata AND speaker diarization, meaning you get 1st and 2nd point (VAD/SAD) and a bit extra, since it annotates when is the same speaker active in a recording.
The other useful tool is LIUM spkdiarization (Java), which basically does the same, except I haven't put enough effort in yet to figure how to get VAD metadata. It features a nice ready to use downloadable package.
With a little bit of compiling, this should work in under an hour.
最好的选择是找到一个可以进行语音识别或说话人识别(而不是语音识别)的开源模块。说话人识别用于识别特定说话人,而语音识别是将语音转换为文本。可能有开源的说话人识别包,您可以尝试在 SourceForge.net 等搜索“说话人识别”或“语音和生物识别”。由于我自己没有使用过,所以无法推荐任何东西。
如果您找不到任何东西,但您有兴趣自行开发一个,那么有大量适用于任何流行语言的开源 FFT 库。技术是:
请注意,完成该项目的小时数很容易超过手动收听录音的 20 小时。但这比磨练 20 小时的音频要有趣得多,而且您将来可以再次使用您构建的软件。
当然,如果从隐私角度来看音频不敏感,您可以将音频试听任务外包给亚马逊的 Mechanical Turk 之类的公司。
The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.
If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:
Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.
Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.
您还可以尝试 pyAudioAnalysis 来:
from pyAudioAnalysis import audioBasicIO as aIO
从 pyAudioAnalysis 导入 audioSegmentation as as
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
snippets = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3,plot = True)
segments
包含非静音片段的端点。You could also try pyAudioAnalysis to:
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)
segments
contains the endpoints of the non-silence segments.如果您熟悉 java,您可以尝试通过 minim 提供音频文件并计算一些 FFT 频谱。可以通过定义样本幅度的最小水平(以排除噪音)来检测静音。为了将语音与音乐分开,可以使用时间窗的 FFT 频谱。语音使用一些非常独特的频带,称为共振峰
- 特别是对于元音 - 音乐在频谱中分布更均匀。
您可能无法 100% 分离语音/音乐块,但标记文件并只听有趣的部分应该足够了。
http://code.compartmental.net/tools/minim/
http://en.wikipedia.org/wiki/Formant
if you are familiar with java you could try to feed the audio files throu minim and calculate some FFT-spectrums. Silence could be detected by defining a minimum level for the amplitude of the samples (to rule out noise). To seperate speech from music the FFT spectrum of a time-window can be used. Speech uses some very distinct frequencybands called formants
- especially for vovels - music is more evenly distributed among the frequency spectrum.
You propably won't get a 100% separation of the speech/music blocks but it should be good enought to tag the files and only listen to the interesting parts.
http://code.compartmental.net/tools/minim/
http://en.wikipedia.org/wiki/Formant
两个想法:
Two ideas: