我刚刚开始在 C++ 程序中使用 ICU4C 实现 ICU 转换。我特别关注中文的音译。
根据本文档,该包同时支持“Han-Latin”和“Latin” -汉”转换。作为一名学习中文的学生,这对我来说似乎很令人惊讶,因为如果没有高度先进的统计技术,拉丁语到汉族的转换就特别困难(我见过的最接近的是谷歌音译,即使没有用户输入,它实际上也做得很好,但这对于目前的项目来说是不可行的),更不用说没有音调标记的转换了。我怀疑,如果不借助比尔·莫瑞等事实上的外国名字借用人物,这是否可能。这是 Google 地图在其国际域中采取的方法,正如我们在本文 (PDF)
无论如何,我愿意暂停怀疑,在查阅文档和教程后,我能够构建两个音译器对象(往返)并使用它们执行简单的音译。
虽然 Han-Latin 工作得相当不错(简单数据的准确率约为 80%),但 Latin-Han 似乎根本不起作用,返回输入的相同“latin”字符串,这与我使用 在线转换示例,和我对中文的了解一致。我设法找到这个表,我认为这两个来源都使用该表,我们可以看到 这里:
{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },
我想这意味着给定一个拼音字符串,它可能会重现原始字符串,但情况似乎并非如此。
我想我的一般问题是:这种转换是否可以通过 ICU 或除 Google Transliterate 之外的其他方式实现?预期产出是多少?相关地,如果这实际上不可能的话,是否有 ICU 实际上支持的脚本对的列表?
谢谢您的宝贵时间
I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.
According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)
Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.
While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:
{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },
I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.
I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?
Thank you for your time
发布评论
评论(1)
请注意,数据来自 CLDR 项目 http://cldr.unicode.org 。 ICU 支持的脚本对很多,ICU 将尝试使用枢轴脚本(例如 Han to Latin to Russian ),这就是为什么您可以创建诸如“Any-Latin”之类的音译器。您可以尝试浏览 ICU 和 CLDR 数据集。汉拉丁文文件顶部的注释表明它不往返。
Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.