什么是好的机器翻译指标或黄金组合
我开始考虑对搜索查询进行一些机器翻译,并一直在尝试考虑不同的方法来在迭代之间以及与其他系统之间对我的翻译系统进行评级。我想到的第一件事是从一群人那里获取 mturk 的一组搜索词的翻译,并说每个词都是有效的,或者类似的东西,但这会很昂贵,而且可能容易让人输入错误的翻译。
现在我正在尝试想出更便宜或更好的东西,我想我应该向 StackOverflow 寻求想法,以防已经有一些可用的标准,或者有人之前尝试过找到其中一个。例如,有谁知道谷歌翻译如何评价其系统的各种迭代?
I'm starting up looking into doing some machine translation of search queries, and have been trying to think of different ways to rate my translation system between iterations and against other systems. The first thing that comes to mind is getting translations of a set of search terms from mturk from a bunch of people and saying each is valid, or something along those lines, but that would be expensive, and possibly prone to people putting in bad translations.
Now that I'm trying to think of something cheaper or better, I figured I'd ask StackOverflow for ideas, in case there's already some standard available, or someone has tried to find one of these before. Does anyone know, for example, how Google Translate rates various iterations of their system?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这里有一些可能有用的信息,因为它提供了 BLEU 评分技术的基本解释,开发人员经常使用该技术来衡量 MT 系统的质量。
第一个链接提供了 BLEU 的基本概述,第二个链接指出了 BLEU 的一些局限性问题。
http://kv-emptypages.blogspot.com /2010/03/need-for-automated-quality-measurement.html
和
http://kv-emptypages.blogspot.com /2010/03/problems-with-bleu-and-new-translation.html
此链接还提供了一些关于如何开发有用的测试集的非常具体的实用建议: AsiaOnline.Net 网站十一月通讯。我无法添加此链接,因为限制为两个。
There is some information here that might be useful as it provides a basic explanation of the BLEU scoring technique that is often used to measure the quality of an MT system by developers.
The first link provides a basic overview of BLEU and the second points out some problems with BLEU in terms of it's limitations.
http://kv-emptypages.blogspot.com/2010/03/need-for-automated-quality-measurement.html
and
http://kv-emptypages.blogspot.com/2010/03/problems-with-bleu-and-new-translation.html
There is also some very specific pragmatic advice on how to develop a useful Test Set at this link: AsiaOnline.Net site in the November newsletter. I am unable to put this link in as there is a limit of two.
我建议完善你的问题。机器翻译有很多指标,这取决于您想要做什么。就您而言,我认为问题可以简单地表述为:“给定一组 L1 语言的查询,我如何在网络搜索上下文中衡量 L2 翻译的质量?”
这基本上是跨语言信息检索。
这里要认识到的重要一点是,您实际上并不关心向用户提供查询的翻译:您希望向他们提供他们可以从查询的良好翻译中获得的结果 。
为此,您可以简单地测量黄金翻译和系统结果之间的结果列表的差异。您可以使用许多关于排名相关性、集合重叠等指标。重点是,您不需要判断每一个翻译,而只需评估自动翻译是否为您提供与人工翻译相同的结果。
至于提出糟糕翻译的人,您可以评估假定的黄金标准候选者是否具有相似的结果列表(即给定 3 个手动翻译,他们的结果是否一致?如果不一致,请使用最重叠的 2 个)。如果是这样,那么从 IR 的角度来看,这些实际上是同义词。
I'd suggest refining your question. There are a great many metrics for machine translation, and it depends on what you're trying to do. In your case, I believe the problem is simply stated as: "Given a set of queries in language L1, how can I measure the quality of the translations into L2, in a web search context?"
This is basically cross-language information retrieval.
What's important to realize here is that you don't actually care about providing the user with the translation of the query: you want to get them the results that they would have gotten from a good translation of the query.
To that end, you can simply measure the discrepancy of the results lists between a gold translation and the result of your system. There are many metrics for rank correlation, set overlap, etc., that you can use. The point is that you need not judge each and every translation, but just evaluate whether the automatic translation gives you the same results as a human translation.
As for people proposing bad translations, you can assess whether the putative gold standard candidates have similar results lists (i.e. given 3 manual translations do they agree in results? If not, use the 2 that most overlap). If so, then these are effectively synonyms from the IR perspective.
在我们的机器翻译评估中我们使用 hLEPOR 分数(详细信息请参阅幻灯片)
In our MT Evaluation we use hLEPOR score (see the slides for details)