用于语音检测和辨别的开源代码

发布于 2024-11-02 18:52:15 字数 414 浏览 8 评论 0原文

我有 15 盘录音带,我相信其中一盘包含我祖母和我谈话的旧录音。快速尝试找到合适的地方并没有找到。我不想听20个小时的磁带才能找到它。该位置可能不在其中一盘磁带的开头。大多数内容似乎分为三类——按照总长度的顺序,最长的在前:静音、语音广播和音乐。

我计划将所有磁带转换为数字格式,然后再次查找录音。最明显的方法是在我做其他事情时在后台播放它们。这对我来说太简单了,所以:是否有任何开源库或其他代码,可以让我按照复杂性和实用性的顺序找到:

  1. 非静音区域
  2. 包含人类语音的区域
  3. 包含我自己语音的区域(以及我祖母的)

我更喜欢 Python、Java 或 C。

由于我对该领域一无所知,所以如果没有答案,有关搜索术语的提示将不胜感激。

我知道我很容易会在这上面花费 20 多个小时。

I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn't turn it up. I don't want to listen to 20 hours of tape to find it. The location may not be at the start of one of the tapes. Most of the content seems to fall into three categories -- in order of total length, longest first: silence, speech radio, and music.

I plan to convert all of the tapes to digital format, and then look again for the recording. The obvious way is to play them all in the background while I'm doing other things. That's far too straightforward for me, so: Are there any open source libraries, or other code, that would allow me to find, in order of increasing sophistication and usefulness:

  1. Non-silent regions
  2. Regions containing human speech
  3. Regions containing my own speech (and that of my grandmother)

My preference is for Python, Java, or C.

Failing answers, hints about search terms would be appreciated since I know nothing about the field.

I understand that I could easily spend more than 20 hours on this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

空‖城人不在 2024-11-09 18:52:16

我之前写过一篇关于使用 Windows 语音识别的博客文章。我有一个关于用 C# 将音频文件转换为文本的基本教程。您可以查看此处

I wrote a blog article ago about using Windows speech recognition. I have a basic tutorial on converting audio files to text in C#. You can check out here.

忘年祭陌 2024-11-09 18:52:16

尝试大胆+以频谱图(logf)形式查看轨迹+训练你的眼睛(!)来识别语音。您需要调整时间尺度和 FFT 窗口。

Try audacity + view track as spectrogram(logf) + train your eyes(!) to recognize speech. You will need to tune time scale and FFT window.

婴鹅 2024-11-09 18:52:15

大多数时候你可能会节省的就是说话人分类。它的工作原理是用说话者 ID 注释录音,然后您可以轻松地将其手动映射到真实的人。错误率通常约为记录长度的 10-15%,这听起来很糟糕,但这包括检测太多说话者并将两个 ID 映射到同一个人,这并不难修复。

其中一个很好的工具是 SHoUT 工具包 (C++),尽管它对输入格式有点挑剔。请参阅作者对该工具的用法。它输出语音/语音活动检测元数据和说话人分类,这意味着您获得第一点和第二点 (VAD/SAD) 以及一点额外的数据,因为它注释了何时相同录音中发言者处于活动状态。

另一个有用的工具是 LIUM spkdiarization (Java),它基本上可以相同,只是我还没有付出足够的努力来弄清楚如何获取 VAD 元数据。它具有一个很好的即用型可下载包

通过一点点编译,这应该会在一小时内完成。

What you probably save you most of the time is speaker diarization. This works by annotating the recording with speaker IDs, which you can then manually map to real people with very little effort. The errors rates are typically at about 10-15% of record length, which sounds awful, but this includes detecting too many speakers and mapping two IDs to same person, which isn't that hard to mend.

One such good tool is SHoUT toolkit (C++), even though it's a bit picky about input format. See usage for this tool from author. It outputs voice/speech activity detection metadata AND speaker diarization, meaning you get 1st and 2nd point (VAD/SAD) and a bit extra, since it annotates when is the same speaker active in a recording.

The other useful tool is LIUM spkdiarization (Java), which basically does the same, except I haven't put enough effort in yet to figure how to get VAD metadata. It features a nice ready to use downloadable package.

With a little bit of compiling, this should work in under an hour.

独自唱情﹋歌 2024-11-09 18:52:15

最好的选择是找到一个可以进行语音识别或说话人识别(而不是语音识别)的开源模块。说话人识别用于识别特定说话人,而语音识别是将语音转换为文本。可能有开源的说话人识别包,您可以尝试在 SourceForge.net 等搜索“说话人识别”或“语音和生物识别”。由于我自己没有使用过,所以无法推荐任何东西。

如果您找不到任何东西,但您有兴趣自行开发一个,那么有大量适用于任何流行语言的开源 FFT 库。技术是:

  • 以数字形式获取您正常说话和您祖母正常说话的典型录音,背景噪音尽可能小
    • 对目标录音中的每一秒音频进行 FFT
    • 从您创建的 FFT 配置文件数组中,滤除低于特定平均能量阈值的任何配置文件,因为它们很可能是噪声
    • 通过对未过滤的 FFT 快照进行平均来构建主 FFT 配置文件
    • 然后对数字化目标音频(20 小时的内容)重复上述 FFT 采样技术
    • 标记目标音频文件中任何时间索引的 FFT 快照与您和您祖母谈话的主 FFT 配置文件相似的任何区域。您需要使用相似性设置,以免出现太多误报。另请注意,您可能必须将 FFT 频率仓的比较限制为仅与主 FFT 配置文件中具有能量的频率仓进行比较。否则,如果您和您祖母谈话的目标音频包含明显的背景噪音,则会影响您的相似度函数。
    • 制定手动检查的时间索引列表

请注意,完成该项目的小时数很容易超过手动收听录音的 20 小时。但这比磨练 20 小时的音频要有趣得多,而且您将来可以再次使用您构建的软件。

当然,如果从隐私角度来看音频不敏感,您可以将音频试听任务外包给亚马逊的 Mechanical Turk 之类的公司。

The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.

If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:

  • Get a typical recording of you talking normally and your grandmother talking normally in digital form, something with as little background noise as possible
    • Take the FFT of every second of audio or so in the target recordings
    • From the array of FFT profiles you have created, filter out any below a certain average energy threshold since they are most likely noise
    • Build a master FFT profile by averaging out the non-filtered FFT snapshots
    • Then repeat the FFT sampling technique above on the digitized target audio (the 20 hours of stuff)
    • Flag any areas in the target audio files where the FFT snapshot at any time index is similar to your master FFT profile for you and your grandmother talking. You will need to play with the similarity setting so that you don't get too many false positives. Also note, you may have to limit your FFT frequency bin comparison to only those frequency bins in your master FFT profile that have energy. Otherwise, if the target audio of you and your grandmother talking contains significant background noise, it will throw off your similarity function.
    • Crank out a list of time indices for manual inspection

Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.

Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.

︶葆Ⅱㄣ 2024-11-09 18:52:15

您还可以尝试 pyAudioAnalysis 来:

  1. 删除静音:

from pyAudioAnalysis import audioBasicIO as aIO
从 pyAudioAnalysis 导入 audioSegmentation as as
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
snippets = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3,plot = True)

segments 包含非静音片段的端点。

  1. 分类:语音与音乐歧视:pyAudioAnalysis 还包括预训练的分类器,可用于将未知片段分类为演讲或音乐。

You could also try pyAudioAnalysis to:

  1. Silence removal:

from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.readAudioFile("data/recording1.wav")
segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)

segments contains the endpoints of the non-silence segments.

  1. Classification: Speech vs music discrimination: pyAudioAnalysis also includes pretrained classifiers, which can be used to classify unknown segments to either speech or music.
他不在意 2024-11-09 18:52:15

如果您熟悉 java,您可以尝试通过 minim 提供音频文件并计算一些 FFT 频谱。可以通过定义样本幅度的最小水平(以排除噪音)来检测静音。为了将语音与音乐分开,可以使用时间窗的 FFT 频谱。语音使用一些非常独特的频带,称为共振峰
- 特别是对于元音 - 音乐在频谱中分布更均匀。

您可能无法 100% 分离语音/音乐块,但标记文件并只听有趣的部分应该足够了。

http://code.compartmental.net/tools/minim/

http://en.wikipedia.org/wiki/Formant

if you are familiar with java you could try to feed the audio files throu minim and calculate some FFT-spectrums. Silence could be detected by defining a minimum level for the amplitude of the samples (to rule out noise). To seperate speech from music the FFT spectrum of a time-window can be used. Speech uses some very distinct frequencybands called formants
- especially for vovels - music is more evenly distributed among the frequency spectrum.

You propably won't get a 100% separation of the speech/music blocks but it should be good enought to tag the files and only listen to the interesting parts.

http://code.compartmental.net/tools/minim/

http://en.wikipedia.org/wiki/Formant

软糯酥胸 2024-11-09 18:52:15

两个想法:

  • 查看“语音识别”字段,例如 CMUSphinx
  • Audacity 有一个可能有用的“截断沉默”工具。

Two ideas:

  • Look in the "speech recognition" field, for example CMUSphinx
  • Audacity has a "Truncate silence" tool that might be useful.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文