C:按静音间隙分割wav文件
我有一群人阅读简单的句子(hello world)作为 wav 文件, 如何通过自动识别单词之间的间隙来将 wav 文件分解为 2 个 wav 文件,每个文件都包含单词(hello 和 world)? 不幸的是,我无法找到为我做这件事的工具,所以我将编写 C 代码来做到这一点, 据我了解,wav 文件中的间隙应该是低数值,对吗? 我知道如何破坏文件 我很高兴能找到解决间隙识别问题的方法。 谢谢你!
I have a bunch human reading simple sentence (hello world) as a wav file,
How can I break the wav file for 2 wav files each contains word (hello and world) by automatically recognizing the gap between the words?
Unfortunately I was unable to find tool to do it for me, so I will write C code that do that,
As for my understanging, the gaps should be low numeric values in the wav file, is that correct?
I know how to break the files,
I Will glad to get approach for the gap recognition problem.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
http://digitalcardboard.com/blog/2009/08/ 25/the-sox-of-silence/
我确信这是您需要的链接。
当 SoX 检测到 5 秒或以上的静音时,它将分割音频。您最终将得到名为 out001.wav、out002.wav 等的输出文件。
http://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/
I am sure this is the link you need.
SoX will split audio when it detects 5 or more seconds of silence. You’ll end up with output files named out001.wav, out002.wav, and so on.
我处理此类任务的方法是将 wav 文件分成每个块(例如每个块 0.05 秒),计算每个块的 RMS 幅度,并将 RMS 放大器与阈值进行比较。如果录音是在仔细控制的条件下完成的,并且语音音量相对良好地标准化,则阈值可以是静态值,但另一种方法是动态地检查比前一个块大得多的块。然后,您将超过阈值的块视为单词的开头。
然而,在随意的演讲中,词与词之间可能没有太多停顿。如果我不间断地对你说“helloworld”,你就能很容易地理解我的意思。
RMS 幅度定义为各个样本的平方随时间变化的平均值的平方根。
The way I approach this kind of task is by breaking the wav file into blocks of, say, 0.05 seconds each, computing the RMS amplitude of each block, and comparing the RMS amp to a threshold. If the recording is done under carefully controlled conditions, and the volume of speech relatively well normalized, the threshold may be a static value, but another way to do it is dynamically, checking for a block that is substantially louder than the previous block. You then consider the over-threshold block to be the start of a word.
However, in casual speech, there may not be much of a pause between words. If I say "helloworld" to you without a pause, you can understand me easily.
RMS amplitude is defined as the square root of the average-over-time of the squares of the individual samples.
请参阅此答案了解音符开始检测(检测音符中音符的开始和结束) WAV 文件与检测 WAV 文件中口语单词的开头和结尾的问题完全相同)。
但请注意,如果没有极其复杂(且尚未存在)的人工智能,您为自己设定的任务基本上是不可能的。当一个人在录音中说话时,各个单词之间通常不存在与多音节单词中各个音节之间的间隙在数字上有任何不同的间隙。
See this answer about note onset detection (detecting the start and end of musical notes in a WAV file is exactly the same problem as detecting the start and end of spoken words in a WAV file).
Please note, however, that the task you've set for yourself is essentially impossible without extremely sophisticated (and not yet in existence) artificial intelligence. When a person speaks in a recording, there usually are not gaps between individual words that are numerically any different from the gaps between individual syllables within multi-syllabic words.