Elman SRN 的单词/音素语料库(英语)
我正在写一个 Elman 简单循环网络。我想给它单词序列,其中每个单词都是音素序列,并且我想要大量的训练和测试数据。
所以,我需要的是一个英语单词的语料库,以及它们组成的音素,写成类似 ARPAbet 或 SAMPA 的形式。英式英语会很好,但不是必需的,只要我知道我在处理什么。有什么建议吗?
我目前没有时间也没有兴趣编写一些代码来从口头或书面数据中派生出单词所组成的音素,所以请不要这样做。
注意:我知道 CMU 发音词典,但它声称它仅基于 ARPBet 符号集 - 有人知道是否确实存在任何差异,如果有的话它们是什么? (如果没有,那么我可以使用它......)
编辑:CMUPD 0.7a 符号列表 - 元音可能有词汇重音,并且有变体(ARPABET 标准符号)表明这一点。
I'm writing an Elman Simple Recurrent Network. I want to give it sequences of words, where each word is a sequence of phonemes, and I want a lot of training and test data.
So, what I need is a corpus of English words, together with the phonemes they're made up of, written as something like ARPAbet or SAMPA. British English would be nice but is not essential so long as I know what I'm dealing with. Any suggestions?
I do not currently have the time or the inclination to code something that derives the phonemes a word is comprised of from spoken or written data so please don't propose that.
Note: I'm aware of the CMU Pronouncing Dictionary, but it claims it's only based on the ARPABet symbol set - anyone know if there are actually any differences and if so what they are? (If there aren't any then I could just use that...)
EDIT: CMUPD 0.7a Symbol list - vowels may have lexical stress, and there are variants (of ARPABET standard symbols) indicating this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
CMUdict 应该没问题。 “Arpabet 符号集”就是Arpabet 的意思。如果存在任何细微差异,应在 CMUdict 文档中进行解释。
如果您需要比将各个单词的字典发音串在一起更接近现实生活的数据,请查找按语音转录的语料库,例如 TIMIT。
CMUdict should be fine. "Arpabet symbol set" just means Arpabet. If there are any minor differences, they should be explained in the CMUdict documentation.
If you need data that's closer to real life than stringing together dictionary pronunciations of individual words, look for phonetically transcribed corpora, e.g., TIMIT.