使用马尔可夫模型将全大写转换为混合大小写及相关问题
我一直在考虑使用马尔可夫技术来恢复自然语言文本中丢失的信息。
- 将全部大写文本恢复为大小写混合。
- 将重音符号/变音符号恢复为应有但已转换为纯 ASCII 的语言。
- 将粗略的音标转换回原生字母。
这似乎是按照从最难到最难的顺序排列的。基本上,问题是根据上下文解决歧义。
我可以使用维基词典作为字典,使用维基百科作为语料库,使用 n 元语法和隐马尔可夫模型来解决歧义。
我走在正确的轨道上吗?是否已经有一些用于此类事情的服务、库或工具?
示例
- 乔治在灌木丛中丢失了他的 SIM 卡 ⇒ 丛林中丢失了 SIM 卡
- 乔治在峡谷部署人员的 ⇒ 部署在峡谷中的一切
I've been thinking about using Markov techniques to restore missing information to natural language text.
- Restore all-caps text to mixed-case.
- Restore accents / diacritics to languages which should have them but have been converted to plain ASCII.
- Convert rough phonetic transcriptions back into native alphabets.
That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.
I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.
Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?
Examples
- GEORGE LOST HIS SIM CARD IN THE BUSH ⇨ George lost his SIM card in the bush
- tantot il rit a gorge deployee ⇨ tantôt il rit à gorge déployée
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为您可以使用马尔可夫模型(HMM)来完成所有这三个任务,但也可以看看更现代的模型,例如条件随机场(CRF)。另外,这里还有一些对你的谷歌功能的提升:
caps
这称为 truecasing。
应该有它们的语言,但是
已转换为纯 ASCII
我怀疑马尔可夫模型在这方面会遇到困难。 OTOH,标记的训练数据是免费的,因为您只需使用目标语言中的一堆带重音的文本并去掉重音即可。另请参阅下一个答案。
回到原生字母
这似乎与机器音译密切相关,已使用 pair 尝试过机器音译HMM(来自生物信息学/基因组工作)。
I think you can use Markov models (HMMs) for all three tasks, but also take a look at more modern models such as conditional random fields (CRFs). Also, here's some boost for your google-fu:
caps
This is called truecasing.
languages which should have them but
have been converted to plain ASCII
I suspect Markov models are going to have a hard time on this. OTOH, labelled training data is free since you can just take a bunch of accented text in the target language and strip the accents. See also next answer.
back into native alphabets
This seems strongly related to machine transliteration, which has been tried using pair HMMs (from bioinformatics/genome work).
我将尝试详细说明如何实现这些目标。
大写
这与命名实体识别相当接近,并且是“序列标记问题”的一个示例。专有名词最初应大写,缩写词的组织名称应全部大写,还有其他示例不属于这些类别。在我看来,它比 NER 更难,因此基于字典的简单方法可能不是理想的解决方案。
如果您要使用隐马尔可夫模型,这相当于让 HMM 的“隐藏”状态为 [lowerCase、initCaps、allCaps] 并对您认为正确的一些数据进行训练(例如维基百科,但还有许多其他来源)也)。然后,您可以推断您不确定大写字母是否正确的单词的隐藏状态。那里有很多 HMM 库,我相信您可以找到一个适合您的需求。我想说尝试 HMM 是一个不错的初始选择。
非 ASCII 字符
正如您所猜测的,这是一个更棘手的问题。如果你尝试在单词级别使用隐马尔可夫模型来做到这一点,你将拥有大量的隐藏状态,每个重音单词都有一个隐藏状态,这可能无法训练。这个问题在角色级别更容易处理,但如果您只考虑前一个角色,您会丢失大量上下文。如果您开始使用 n-gram 而不是字符,您的缩放问题就会再次出现。简而言之,我认为这个问题不像前一个问题,因为标签数量太大,无法将其视为序列标记问题(我的意思是你可以,只是不切实际)。
我没有听说过这方面的研究,再说一遍,我不是专家。我最好的猜测是对您感兴趣的语言使用通用的语言模型。您可以使用它可以为您提供该语言中句子的概率。然后,您可以尝试替换可能的重音字符以给出这些句子的概率并采取最有可能的,或者对差异使用一些阈值,或者类似的东西。您可以在某种语言的大型语料库上相当轻松地训练 n-gram 语言模型。
我不知道这是否真的有效,无论是在准确性还是效率方面。我对这个特定问题没有直接经验。
音译
老实说,不知道。我不知道你在哪里可以找到数据来制作你自己的系统。经过简短的搜索,我找到了 Google 音译 服务(带 API)。也许它可以满足您的需求。我什至没有足够的使用其他脚本的语言的经验来真正了解它在做什么。
I'll take a crack at fleshing out how you would accomplish these.
Capitalisation
This is fairly close to Named Entity Recognition and is an example of a 'sequence tagging problem'. Proper nouns should be initially capitalised, organisation names that are acronyms should be all capitalised, and then there are other examples that fall outside those categories. It seems to me that it would therefore be harder than NER and so a straightforward dictionary based approach probably wouldn't be an ideal solution.
If you were to use an Hidden Markov Model, this would amount to letting the 'hidden' state of an HMM be [lowerCase, initCaps, allCaps] and training on some data that you assume is correct (e.g. Wikipedia but there are many other sources too). You then infer the hidden state for words that you aren't sure are correctly capitalised. There are a bunch of HMM libraries out there, I'm sure you can find one to suit your needs. I'd say trying an HMM is a good initial choice.
Non ASCII characters
As you guessed, a tougher problem. If you tried to do this with an HMM at the word level, you would have an enormous number of hidden states, one for each accented word, which would probably be impossible to train. The problem is more tractable at the character level but you lose a tremendous amount of context if you only consider the previous character. If you start using n-grams instead of characters, your scaling problems come back. In short, I don't think this problem is like the previous one because the number of labels is too large to consider it a sequence labelling problem (I mean you can, it's just not practical).
I haven't heard of research in this area, then again I'm no expert. My best guess would be to use a general language model for the language you are interested in. You could use this to give you a probability of a sentence in the language. Then you could try replacing possibly accented characters to give the probabilities of those sentences and take the most likely, or use some threshold on the difference, or something like that. You could train an n-gram language model fairly easily on a large corpus of a certain language.
I have no idea if this would actually work, either in terms of accuracy or efficiency. I don't have direct experience of this particular problem.
Transliteration
No idea, to be honest. I don't know where you would find data to make a system of your own. After a brief search, I found the Google Transliteration service (with API). Perhaps it does what you're after. I don't even have enough experience in languages with other scripts to really know what it's doing.