使用计算机语言处理工具,考虑上下文,翻译单个单词
我想为外语学习者自动注释文本并翻译困难的单词。
例如,如果原文是:
埃尔加托埃斯塔恩拉卡萨德米斯韦西诺斯
萨德米斯韦西诺斯
成为El gato esta en la casa de miss vecinos(邻居)
第一步是确定哪些单词是困难的。这可以通过对原文中的单词进行词形还原并将它们与“简单单词”列表(1500-2000 个单词的基本词汇)进行比较来完成。在此列表中未找到的内容将被指定为“难词”。使用 Python 自然语言工具包 (NLTK),这个过程看起来非常简单。
必须成对翻译的单词存在一些困难,例如“新婚”或短语动词“他叫我上”或德语“er <强>ruft mich an' (anrufen)。这里的词不能单独对待。对于短语动词等,也许需要对语法有一定的了解。
第二步是根据困难单词出现的上下文获得正确的翻译。据我了解,这有效地应用了谷歌翻译等统计机器翻译系统的前半部分。我相信这个问题可以使用 Google Translate Research API 来解决,它可以让您发送要翻译的文本,并且响应包含有关翻译中的哪个单词对应于原始文本中的哪个单词的信息。因此,您可以输入整个句子,然后从响应中找出您想要的单词。但是,您必须申请才能使用此 API,并且它们有使用限制,这可能对我的应用程序来说是一个问题。我宁愿寻找另一种解决方案。我预计没有任何解决方案能够提供 100% 正确的翻译,并且必须手动检查它们,但这仍然应该加快速度。
感谢您的评论。
大卫
I would like to automatically annotate texts for learners of foreign languages with translations of difficult words.
For instance, if the original text is:
El gato esta en la casa de mis vecinos
Becomes
El gato esta en la casa de mis vecinos (neighbours)
The first step is to identify which words are the difficult ones. This could be done by lemmatization of the words in the original text and comparing them with a list of 'easy words' (a basic vocabulary of 1500-2000 words). Those not found in this list will be designated as 'hard words.' This process seems straightforward enough using the Natural Language Tool Kit (NLTK) for Python.
There is some difficulty in words that must be translated as a pair, such as 'newly weds,' or phrasal verbs 'he called me up' or the German 'er ruft mich an' (anrufen). Here words can't be treated individually. For phrasal verbs and the like perhaps some understanding of grammer is needed.
The second step involves obtaining a correct translation of the difficult words according to context in which they appear. As I understand, this is effectively applying the first half of a statistical machine translation system like google translate. I believe this problem could solved using the Google Translate Research API, that lets you send text to be translated, and the response includes information about which word in the translation corresponds to which word in the original text. So you could feed in the whole sentence and then fish out the word you wanted from the response. You have to apply to use this API however, and they have usage limits, which would likely be a problem for my application. I would rather find another solution. I expect no solution will give 100% correct translations and they will have to be checked by hand, but this should still speed things up.
Thanks for your comments.
David
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于初始步骤,无需依赖先验词汇表 - 只需在训练语料库中累积标记计数并标记测试集中在排序词汇表中的截止点之前未出现的标记就足够了。
http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary
对于第二步,“根据难词出现的上下文获得难词的正确翻译”,是的,您需要访问 MT API 和/或人工翻译。选择最佳方法取决于您的目标。
您可以拥有正确的翻译、快速的翻译或廉价的翻译 - 我知道您不可能同时拥有这三者。
For the initial step, there is no need to rely on a priori vocabulary - simply accumulating token counts in a training corpus and marking the tokens in your test set that do not occur before a cutoff point in the rank-ordered vocabulary should suffice.
http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary
For the second step, "obtaining a correct translation of the difficult words according to context in which they appear", yes, you would need access to a MT API and/or human translation. Choosing the best approach depends on your goals.
You can have a correct translation, a fast translation, or a cheap translation - I know of no way you can have all three simultaneously.