当前位置：文江博客话题详情

ICU 中的拉丁文<->汉文转换？

发布于 2024-11-04 11:51:48 字数 1389 浏览 2 评论 0 原文

我刚刚开始在 C++ 程序中使用 ICU4C 实现 ICU 转换。我特别关注中文的音译。

根据本文档，该包同时支持“Han-Latin”和“Latin” -汉”转换。作为一名学习中文的学生，这对我来说似乎很令人惊讶，因为如果没有高度先进的统计技术，拉丁语到汉族的转换就特别困难（我见过的最接近的是谷歌音译，即使没有用户输入，它实际上也做得很好，但这对于目前的项目来说是不可行的），更不用说没有音调标记的转换了。我怀疑，如果不借助比尔·莫瑞等事实上的外国名字借用人物，这是否可能。这是 Google 地图在其国际域中采取的方法，正如我们在本文 (PDF)

无论如何，我愿意暂停怀疑，在查阅文档和教程后，我能够构建两个音译器对象（往返）并使用它们执行简单的音译。

虽然 Han-Latin 工作得相当不错（简单数据的准确率约为 80%），但 Latin-Han 似乎根本不起作用，返回输入的相同“latin”字符串，这与我使用在线转换示例，和我对中文的了解一致。我设法找到这个表，我认为这两个来源都使用该表，我们可以看到这里：

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

我想这意味着给定一个拼音字符串，它可能会重现原始字符串，但情况似乎并非如此。

我想我的一般问题是：这种转换是否可以通过 ICU 或除 Google Transliterate 之外的其他方式实现？预期产出是多少？相关地，如果这实际上不可能的话，是否有 ICU 实际上支持的脚本对的列表？

谢谢您的宝贵时间

原文

I am just getting started implementing ICU transforms using ICU4C in a C++ program. I am particularly looking at transliteration to and from Chinese.

According to this document, the package supports both "Han-Latin" and "Latin-Han" conversion. As a student of Chinese, this seems surprising to me, as Latin-Han conversion is particularly difficult to do without highly advanced statistical techniques (The closest I have seen is Google Transliterate, which actually does a great job with this even without user input, but this is unfeasible for the present project), much less conversion without tone marks. I am skeptical that this is even possible, without resorting to the de facto foreign-name borrowing characters such as 比尔·莫瑞. This is the approach taken by Google Maps in their international domains, as we can see in this paper (PDF)

Anyhow, I was willing to suspend disbelief, and after consulting documentation and tutorials, I was able to construct two Transliterator objects (to and from) and perform simple transliteration using them.

While Han-Latin worked pretty passably (about 80% accuracy for simple data), Latin-Han seemed not to work at all, returning the same "latin" string that was input, which is consistent with the results I get using the online transform sample, and consistent with what I know about Chinese. I managed to find this table, which I think is what is used for both sources, as we can see here:

{ "Latin-Han", "file", "t_Hani_Latn", "REVERSE" },
{ "Han-Latin", "file", "t_Hani_Latn", "FORWARD" },

I would presume this meant that given a pinyin string it could potentially work to reproduce the original, but this does not seem to be the case.

I guess my general question is this: is this kind of transform even possible with ICU, or anything besides Google Transliterate? What is the expected output? Relatedly, is there a listing somewhere of the script-pairs that ICU actually supports, if this is not really possible?

Thank you for your time

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里人 2024-11-11 11:51:48

请注意，数据来自 CLDR 项目 http://cldr.unicode.org 。 ICU 支持的脚本对很多，ICU 将尝试使用枢轴脚本（例如 Han to Latin to Russian ），这就是为什么您可以创建诸如“Any-Latin”之类的音译器。您可以尝试浏览 ICU 和 CLDR 数据集。汉拉丁文文件顶部的注释表明它不往返。

回复收藏 0 原文

~没有更多了~

关于作者

天气好吗我好吗

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

ICU 中的拉丁文<->汉文转换？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

ICU 中的拉丁文<->汉文转换？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。