使用计算机语言处理工具,考虑上下文,翻译单个单词

发布于 2024-10-26 19:03:05 字数 857 浏览 2 评论 0原文

我想为外语学习者自动注释文本并翻译困难的单词。

例如,如果原文是:

埃尔加托埃斯塔恩拉卡萨德米斯韦西诺斯

萨德米斯韦西诺斯

成为

El gato esta en la casa de miss vecinos邻居

第一步是确定哪些单词是困难的。这可以通过对原文中的单词进行词形还原并将它们与“简单单词”列表(1500-2000 个单词的基本词汇)进行比较来完成。在此列表中未找到的内容将被指定为“难词”。使用 Python 自然语言工具包 (NLTK),这个过程看起来非常简单。

必须成对翻译的单词存在一些困难,例如“新婚”或短语动词“他”或德语“er <强>ruft mich an' (anrufen)。这里的词不能单独对待。对于短语动词等,也许需要对语法有一定的了解。

第二步是根据困难单词出现的上下文获得正确的翻译。据我了解,这有效地应用了谷歌翻译等统计机器翻译系统的前半部分。我相信这个问题可以使用 Google Translate Research API 来解决,它可以让您发送要翻译的文本,并且响应包含有关翻译中的哪个单词对应于原始文本中的哪个单词的信息。因此,您可以输入整个句子,然后从响应中找出您想要的单词。但是,您必须申请才能使用此 API,并且它们有使用限制,这可能对我的应用程序来说是一个问题。我宁愿寻找另一种解决方案。我预计没有任何解决方案能够提供 100% 正确的翻译,并且必须手动检查它们,但这仍然应该加快速度。

感谢您的评论。

大卫

I would like to automatically annotate texts for learners of foreign languages with translations of difficult words.

For instance, if the original text is:

El gato esta en la casa de mis vecinos

Becomes

El gato esta en la casa de mis vecinos (neighbours)

The first step is to identify which words are the difficult ones. This could be done by lemmatization of the words in the original text and comparing them with a list of 'easy words' (a basic vocabulary of 1500-2000 words). Those not found in this list will be designated as 'hard words.' This process seems straightforward enough using the Natural Language Tool Kit (NLTK) for Python.

There is some difficulty in words that must be translated as a pair, such as 'newly weds,' or phrasal verbs 'he called me up' or the German 'er ruft mich an' (anrufen). Here words can't be treated individually. For phrasal verbs and the like perhaps some understanding of grammer is needed.

The second step involves obtaining a correct translation of the difficult words according to context in which they appear. As I understand, this is effectively applying the first half of a statistical machine translation system like google translate. I believe this problem could solved using the Google Translate Research API, that lets you send text to be translated, and the response includes information about which word in the translation corresponds to which word in the original text. So you could feed in the whole sentence and then fish out the word you wanted from the response. You have to apply to use this API however, and they have usage limits, which would likely be a problem for my application. I would rather find another solution. I expect no solution will give 100% correct translations and they will have to be checked by hand, but this should still speed things up.

Thanks for your comments.

David

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

回忆凄美了谁 2024-11-02 19:03:05

对于初始步骤,无需依赖先验词汇表 - 只需在训练语料库中累积标记计数并标记测试集中在排序词汇表中的截止点之前未出现的标记就足够了。

http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary

对于第二步,“根据难词出现的上下文获得难词的正确翻译”,是的,您需要访问 MT API 和/或人工翻译。选择最佳方法取决于您的目标。

您可以拥有正确的翻译、快速的翻译或廉价的翻译 - 我知道您不可能同时拥有这三者。

For the initial step, there is no need to rely on a priori vocabulary - simply accumulating token counts in a training corpus and marking the tokens in your test set that do not occur before a cutoff point in the rank-ordered vocabulary should suffice.

http://vuw.academia.edu/JosephSorell/Papers/549885/Zipfs_Law_and_Vocabulary

For the second step, "obtaining a correct translation of the difficult words according to context in which they appear", yes, you would need access to a MT API and/or human translation. Choosing the best approach depends on your goals.

You can have a correct translation, a fast translation, or a cheap translation - I know of no way you can have all three simultaneously.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文