为全球应用程序设计国际翻译/语言适配器
我将为 Node.js(服务器端 Javascript)实现此功能,但这个问题是关于如何解决此问题的一般方法。
有许多平台支持国际申请的翻译。
例如,Zend 的翻译适配器的工作方式如下:
printf($translate->_("Today is the %1\$s") . "\n", date("d.m.Y"));
Android 的系统使用适用于每种语言的 strings.xml 文件,其工作原理与 Zend 的概念相同。
这些适用于大多数西方语言。然而,许多非西方语言需要不同的词序,或者甚至是从右到左而不是从左到右的方向阅读。
因此,上述翻译调用中定义的指定顺序对于“外”语言可能无效。
这引出了我的问题,如何设计适合任何语言的翻译系统/适配器?
I'll be implementing this for Node.js (server side Javascript), but this question is about the general approach on how to solve this problem.
There are many platforms that support translation for international applications.
For example, Zend's Translation Adapter works like this:
printf($translate->_("Today is the %1\$s") . "\n", date("d.m.Y"));
Android's system uses a strings.xml file for every language and works with the same concept as Zend's.
These work for most western languages. However, many non-western languages require different word orders or are even read from the right-to-left instead of left-to-right direction.
Thus, the specified order defined in the above translate call may be invalid for a "foreign" language.
This brings me to my question, how does one design a translation system/adapter that is appropriate for any language?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
实际上很难直接回答这个问题。这里有很多用例。如果我要设计这样的系统,我会记住这些事情:
1.翻译后句子可能需要重新排序(您已经提出了这一点)。这就是我们使用 {1}、{2} 等编号占位符以及一些格式化消息的方法的原因。
2. 有相当多的语言具有不止一种复数形式。也就是说,如果消息包含一些数字,则根据数量,它会以不同的方式进行翻译。例如:
英语:已发现1个病毒|已发现2种病毒 |已发现 5 种病毒
波兰语:Znaleziono 1 wirusa | Znaleziono 2 钢丝 | Znaleziono 5 wirusów
这不容易处理,但我真的很喜欢 GetText 执行此操作(有一些表达式将决定使用哪种形式,以及对多种形式的支持)。
3.此类库的用户可能希望命名占位符(请参阅 I18n 标记中的先前问题),例如“这是 ${location} 中 ${name} 的消息”并将其用于像这样的例子:
var formatted = 'This is a message for ${name} in ${location}'.format('location=Warsaw', 'name=Paweł');
虽然这会带来一些国际化问题,但我很确定它可以在 JavaScript 中完成(尽管传递命名参数(又名参数)的方式可能需要不同。
4. Java 倾向于格式化数字以及 MessageFormat.format() 方法中特定区域设置的日期,这不是理想的行为,并且不会造成任何问题,尤其是在 JavaScript 中。首先您需要知道的是,当前用户的区域设置是什么。你知道,这很容易吗?嗯,不。Java 将它们枚举为:完整、长、短和默认格式,不幸的是,格式化时没有区别 - 总是使用 AFAIR Short。当然,可以将其格式传递给占位符,如下所示(AFAIR):{0,date,yyyy-MM-dd},这会带来另一个问题:翻译器总是必须提供格式,这很容易出错。相反,我会使用默认模式进行格式化(如果没有给出附加信息)并允许传递模式名称:{0,date,long}。
对于数字,它可以是任何内容:货币、百分比或简单的数值。您还需要支持这种区别,例如:{0,currency,symbol:$,long}、{0,percentage}、{0,number,long}。猜出我的意思并不容易,但对于大数字,您可能需要使用分组分隔符(1,000,000.00$),我们称之为长格式,而有时您想打印这样的数字:1234。这不是一件容易的事。
5. .Net有用户界面文化(CurrentUICulture)和格式化文化(CurrentCulture)的概念。第一个用于确定用户界面消息的适当语言,而第二个用于格式化(数字、日期、货币等)。
6.不同的语言倾向于使用不同的排序规则顺序,甚至连同一语言可以使用两种(或更多)不同的语言。我不确定它是否符合范围,但至少了解一下是件好事。
7. 可能需要支持不同的字符编码(并且可能会是)。但是,您可能希望将资源文件的编码限制为 UTF-8。它不会涵盖所有可能的字符(例如,请参见 GB18030),但它已经很接近了。
...?
好吧,我确信我忘记了一些重要的事情,因为你即将面临的任务是巨大的。而且我对 Node.js 知之甚少(目前支持的是什么)。
编辑
8. 当然我忘了提及,随着软件的发展,只有很少的用户界面消息发生变化,因此需要合并旧的翻译(这称为利用)以母语术语来说)。通常会使用某种翻译记忆库软件(例如 POEdit,GetText 文件格式编辑器内置此类功能)。 TM 软件通常仅支持某些文件格式,因此最好坚持使用现有格式而不是创建自己的格式。这可能意味着从列表中删除一些功能......
It is actually very hard to answer this question directly. There are a lot of use cases here. If I was to design such system, I would have keep these things in mind:
1. Sentence might need to be re-ordered after translation (you already brought this up). That is the reason why we use numbered placeholders like {1}, {2} and some means of Formatting the Message.
2. There are quite a few languages that have more than one plural form. That is, if message contain some number, depending on quantity it would be translated in a different way. For example:
English: 1 virus has been found | 2 viruses have been found | 5 viruses have been found
Polish: Znaleziono 1 wirusa | Znaleziono 2 wirusy | Znaleziono 5 wirusów
That is not easy to handle, but I really like the way GetText does this (there is some expression which will decide what form to use, as well as support for multiple forms).
3. Users of such library might want to have named placeholders (see previous questions in I18n tags), like this "This is a message for ${name} in ${location}" and use it for example like this:
var formatted = 'This is a message for ${name} in ${location}'.format('location=Warsaw', 'name=Paweł');
While this poses some i18n issue, I am pretty sure that it could be done in JavaScript (although the way you pass named parameters (aka arguments) might need to be different.
4. Java tend to format Numbers as well as Dates for a specific locale in MessageFormat.format() method. This is not the ideal behavior, and it poses few problems, especially in JavaScript. Well, first thing you need to know is, what is current user's Locale. If you do, is it easy? Well, no. There are quite a few possible date formats - Java enumerates them as: full, long, medium, short and default. Unfortunately, there is no distinction during formatting - AFAIR short would always be used. Of course, one could pass his format to placeholder as something like this (AFAIR): {0,date,yyyy-MM-dd}. This poses another issue: the Translators would always have to provide the format. This is error prone. Instead, I would format with default pattern (if no additional info is given) and allow passing pattern names: {0,date,long}.
For numbers, it could be anything: currency, percentage or simple numeric value. You would also need to support the distinction, some examples: {0,currency,symbol:$,long}, {0,percentage}, {0,number,long}. It is not easy to guess what I mean, but for large numbers you might want to use grouping separators (1,000,000.00$), let's call it long format, whereas sometimes you would like to print number like this: 1234. Not an easy task.
5. .Net has concept of User Interface Culture (CurrentUICulture) and Formatting Culture (CurrentCulture). First is in use to determine the appropriate language for User Interface messages, whereas second is in use for formatting (numbers, dates, currencies, etc.).
6. Different languages tend to use different Collation order, heck even the same language could use two (or more) different ones. I am not sure if it fits the scope, but it at least good to be aware of.
7. Support for different Character Encodings might be required (and probably will be). However, you might want to limit the Encoding for resources file to say UTF-8. It won't cover all possible characters (see GB18030 for example), but it is close.
... ?
Well, I am sure I forgot something major, as the task you are approaching is monumental. And I don't know much about Node.js (as in what is currently supported).
Edit
8. Of course I forgot to mention that as software evolves, only few User Interface messages change, therefore there is some need of merging the old translations (it is called Leveraging in L10n terms). Usually some kind of Translation Memory software is in use (for example POEdit, the GetText file format editor has such features built in). The TM software usually have support limited to certain file formats only, so it would be a good idea to stick with existing format rather than creating your own. This could mean dropping some features off the list...
您的设计应该允许...
参数重新排序
正如您所确定的,翻译人员可能需要重新排序参数以适应不同的语法。因此,无论您使用什么系统,您都需要为参数命名或给它们一个索引。
格式化程序
我想您可以将它们留给开发人员在替换它们之前进行转换,但在某些地方人们会想要对数字、货币、日期和时间进行区域设置敏感的格式化。您可能想将其扩展到多元化,但这是您可能不想打开的一罐蠕虫。
唯一键
查找键必须是唯一的。使用未翻译的字符串作为密钥是有风险的,因为相同源字符串的翻译可能会根据其上下文而有所不同。
工具
让翻译人员随意使用“纯文本文件”可能会造成麻烦。理想情况下,您需要某种机制来处理编码、添加专家的翻译注释、恢复版本之间的翻译并验证结果字符串以确保替换参数与源字符串匹配。
我会从 ICU、.Net 和 Java API 中寻找灵感。
Your design should allow for...
Reordering of parameters
As you have identified, translators may need to reorder parameters to suit different grammars. So, whatever system you use, you need to either make the parameters named or give them an index.
Formatters
I guess you could leave these to the developer to transform before substituting them, but somewhere people are going to want to do locale-sensitive formatting of numbers, currencies, dates and times. You may want to stretch that to pluralization, but that's a can of worms you may not want to open.
Unique keys
The lookup keys need to be unique. Using the untranslated string as a key is risky as translations of identical source strings may differ depending on their context.
Tools
Letting translators loose with "plain text files" is likely to cause trouble. You'll ideally want some mechanism to handle encodings, add translation comments from specialists, recover translations between versions and validate resultant strings to ensure the substitution parameters match the source strings.
I'd look at the ICU, .Net and Java APIs for inspiration.