单词边界的音频挖掘

发布于 2024-11-03 07:22:09 字数 2226 浏览 1 评论 0 原文

我打算做什么:

我想培养英语口音(无需专业培训)。

我的推理和执行摘要背后的一组公理:

以下内容故意过于简化,对此感到抱歉。我试图让问题简短一些。

第 1 部分:了解学习的运作方式。

目前我假设,布罗卡区韦尼克区必须了解该语言,并且现有语音字母表的肌肉记忆将构建语音。口音是随着时间的推移通过语音字母同化自然形成的。

areas

使用 Google 我发现 语音阴影,可用于语音符号同化。另一方面,肌肉记忆可以通过重复动作轻松训练。如果一个人的年龄在 23-24 岁,并且有大量无法解释的时间,那么这是最有效的,因为失去注意力会大大降低有效的学习曲线梯度。这种程序内存可能可以优化为在具有设计睡眠模式的记忆

第 2 部分:设计行为模式

  • 寻找一个我想要的口音流利的说话者。
  • 区分目标口音音素和音素。
  • 训练肌肉记忆以产生目标口音。

第 3 部分:寻找一个我想要的口音流利的说话者。

YouTube 是一个强大的免费资源。示例音频,我很难挑选: 音频 Someone Like You - 阿黛尔(封面) 高清。

这是高音调的女声,这并不困扰我。

第 4 部分:区分目标重音音素和音素。

识别并判断电话通话是否正确,这并不是一项简单的任务。以及人类说出有形文本的正确程度。事实上,它看起来非常复杂,我不会费心去自动化它,只是使用 IPA 作为基线。

这是上面示例音频中第一首带有美国 IPA 单词重音的诗篇: IPA

无意侵犯版权。图像是使用 upodn 创建的(或者:photransedit)。

第 5 部分:训练肌肉记忆以产生目标口音。

虽然尝试模仿和存档同步很有趣,但我更喜欢构建一个工具,将单词提取为音频文件。所以我可以使用 winamp 或 iPod 来循环和随机播放我想要的单词。

我想我可以使用 MS Expression Encoder 来实现这一点。

问题

如果给定一个音频文件(例如 wav 格式,大小 < 32mb)及其等效文本(有限单词数,例如 2000),那么如何将其拆分为多个文件,每个文件包含 1 个单词。 Word 可以包含一些多余的空格,并且边界检查可以由用户批准。如果它不准确,那么获得单词边界的良好估计的最佳方法是什么。

主要目的是减少我要做的工作(如果这是手动完成的话)。

What I plan on doing:

I want to develop the English accent (without professional training).

Set of axioms behind my reasoning with executive summary:

Following is knowingly over simplified, sorry for that. I tried to keep question short.

Part 1 : Understanding how learning works.

At the moment I assume, that Broca's area and Wernicke's area must be aware of the language, and muscle memory with existing phonetic alphabet will build the speech. Accents are just formed naturally over time by phonetic alphabet assimilation.

areas

Using Google I found, that speech shadowing, can potentially be used for phonetic symbol assimilation. Muscle memory on the other hand can be easily trained by repetitive action. And this is most effective, if person is of 23-24 years of age and has lots of uninterpretable time on his/her hand as losing focus can dramatically decrease effective learning curve gradient. This kind of procedural memory can be probably optimized to flushed in memory with designed sleep pattern.

Part 2 : Designing behavioral pattern

  • Finding a fluent speaker whom accent I want to sound like.
  • Distinguishing target accent phonemes and phones.
  • Training muscle memory to produce target accent.

Part 3 : Finding a fluent speaker whom accent I want to sound like.

Youtube is a powerful free resource. Sample audio, that I tough about picking :
audio
Someone Like You - Adele (Cover)
in HD.

It does not bother me, that it is high pitched female voice.

Part 4 : Distinguishing target accent phonemes and phones.

It is not a trivial task - identifying and judging whether spoken phone is correct. And how correctly tangible text is spoken by human. It seems so complex in fact, that I wont bother automating it and just use IPA as baseline.

Here is the first psalm with word stress in american IPA of the sample audio above :
IPA

No copyright infringement intended. And image is created with upodn (alternative: photransedit).

Part 5 : Training muscle memory to produce target accent.

Although it is fun to just try to mimic and archive synchronization, then i would prefer building a tool, that extracts words as audio files. So I can use winamp or ipod to loop and shuffle the words I want.

I imagine, that I can use MS Expression Encoder for this.

Question

If given an audio file (ex. in wav format, size < 32mb) and it's text equivalent (finite nr of words, ex. 2000), then how to split it into multiple files, that each contains 1 word. Word can contain some excess whitespace, and boundary checks can be user approved. If it is not accurate, then what is the best way, to get good estimation for word boundaries.

Main intention is to reduce work, that I would be doing, if this would be done manually.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

野侃 2024-11-10 07:22:09

检测单词边界是一项非常复杂的任务!我不知道您是否对此进行了更多研究,但请参阅 Saffran 等人 (1996)。 分词:分布提示的作用
对于许多语言来说,也有许多语言生成的“语料库”,所以我不会使用新人,而是研究语言学文献中已经完成的关于检测单词边界的工作。

Detecting word boundaries is an intensely complex task! I don't know if you've looked into this more, but see Saffran et al., (1996). Word Segmentation: The role od Distributional Cues.
There are also many many "corpuses" of language production out there for many languages, so rather than using a new person, I'd look into what's already been done in the Linguistics literature on detecting word boundaries.

倾其所爱 2024-11-10 07:22:09

首先,我将通过对其运行 FFT 将信号从时域转换为频域。这可能允许您将文本中的某些辅音与 fft 中的宽带噪声相匹配。这里的问题是,您并不是试图进行完整的语音识别,只是找到信号与文本的最佳匹配。 (当我在大学时,我对文档图像突出显示做了类似的事情 - 不需要诉诸 OCR,因为我已经有了文本)。我的猜测是,寻找幅度的下降不会对你有太大帮助,因为有些单词会相互冲突。

以下是我第一次尝试的方法:

  1. 分析文本/国际音标中以辅音开头的单词,这些单词会在频谱中产生易于识别的模式。
  2. 从高阈值开始,检测模式的实例。
  3. 降低阈值,直到获得正确数量的实例,并且它们之间的相对距离与您对与文本的距离的估计相匹配。
  4. (如果可能,请在此处获取分割点的用户验证)
  5. 这应该为您提供一组希望简短的短语和频谱块。
  6. 使用另一种特征检测方法将这些块分割成单词。
  7. 继续下去,直到你只剩下一个单词。

我确信它可以被推广,但这就是我尝试的方式。

First of all I would convert the signal from the time domain into the frequency domain by running a FFT over it. That might allow you to match certain consonant sounds in your text to broadband noise in the fft. The thing here is that you're not trying to do full speech recognition, just find the best match of signal to text. (I did something similar for document image highlighting back when I was at uni - didn't need to resort to OCR because I already had the text). My guess is that looking for dips in amplitude won't help you that much because some words run into each other.

Here's how I'd approach it for a first attempt:

  1. Analyze the text/IPA for words that start with consonants that result in an easily-identifiable pattern in the frequency spectrum.
  2. starting with a high threshold, detect instances of the pattern.
  3. Lower the threshold until you get the right number of instances and the relative distances between them match your estimate of the distance from the text.
  4. (if possible, get user verification of split points here)
  5. This should give you a set of hopefully short phrases and blocks of spectrum.
  6. Split these blocks into words by using another feature detection method.
  7. Continue until you have only single words.

I'm sure it could be generalized, but that's how I'd attempt it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文