ICU 中的拉丁文<->汉文转换?

发布于 2024-11-04 11:51:48 字数 1389 浏览 2 评论 0 原文


我刚刚开始在 C++ 程序中使用 ICU4C 实现 ICU 转换。我特别关注中文的音译。

根据本文档,该包同时支持“Han-Latin”和“Latin” -汉”转换。作为一名学习中文的学生,这对我来说似乎很令人惊讶,因为如果没有高度先进的统计技术,拉丁语到汉族的转换就特别困难(我见过的最接近的是谷歌音译,即使没有用户输入,它实际上也做得很好,但这对于目前的项目来说是不可行的),更不用说没有音调标记的转换了。我怀疑,如果不借助比尔·莫瑞等事实上的外国名字借用人物,这是否可能。这是 Google 地图在其国际域中采取的方法,正如我们在本文 (PDF)

无论如何,我愿意暂停怀疑,在查阅文档和教程后,我能够构建两个音译器对象(往返)并使用它们执行简单的音译。

虽然 Han-Latin 工作得相当不错(简单数据的准确率约为 80%),但 Latin-Han 似乎根本不起作用,返回输入的相同“latin”字符串,这与我使用 在线转换示例,和我对中文的了解一致。我设法找到这个表,我认为这两个来源都使用该表,我们可以看到 这里

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

我想这意味着给定一个拼音字符串,它可能会重现原始字符串,但情况似乎并非如此。

我想我的一般问题是:这种转换是否可以通过 ICU 或除 Google Transliterate 之外的其他方式实现?预期产出是多少?相关地,如果这实际上不可能的话,是否有 ICU 实际上支持的脚本对的列表?

谢谢您的宝贵时间

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.

According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)

Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.

While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.

I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?

Thank you for your time

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦里人 2024-11-11 11:51:48

请注意,数据来自 CLDR 项目 http://cldr.unicode.org 。 ICU 支持的脚本对很多,ICU 将尝试使用枢轴脚本(例如 Han to Latin to Russian ),这就是为什么您可以创建诸如“Any-Latin”之类的音译器。您可以尝试浏览 ICU 和 CLDR 数据集。汉拉丁文文件顶部的注释表明它不往返。

Note that the data is from the CLDR project, http://cldr.unicode.org . The script pairs that ICU supports are many, ICU will attempt to use a pivot script ( such as Han to Latin to Russian ) which is why you can create transliterators such as "Any-Latin". You might try browsing the ICU and CLDR data set. The note at the top of the Han-Latin file says that it does not round trip.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文