带有中文的 Google AJAX Language API

发布于 2024-08-16 22:24:33 字数 158 浏览 12 评论 0 原文

有谁知道是否支持中文拼音？我在此处获得了带有正确中文拼音的结果（请参阅“显示罗马化”链接）。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你好，陌生人 2024-08-23 22:24:33

我不知道 Google AJAX 语言 API 是否支持转换为拼音，但如果不支持，那么在您的 on 上进行尚可的转换实际上并不难。（从拼音到汉字（字符）的反向转换更加更加棘手，因为拼音的损耗很大。）

要自己进行转换，请获取 Unihan.zip，Unihan 数据库。您真正关心的文件是 Unihan_Readings.txt。它还包含一堆你不关心的东西，而且它的存储方式也相当低效，所以不要太担心大文件大小。您应该提取您关心的内容并以更有效的方式存储它。

在其中您将找到如下所示的制表符分隔行：

U+597D  kCantonese      hou2 hou3
U+597D  kDefinition     good, excellent, fine; well
U+597D  kHangul         호
U+597D  kHanyuPinlu     hao3(6060) hao1(142) hao4(115)
U+597D  kHanyuPinyin    21028.010:hǎo,hào
U+597D  kJapaneseKun    KONOMU SUKU YOI
U+597D  kJapaneseOn     KOU
U+597D  kKorean         HO
U+597D  kMandarin       HAO3 HAO4
U+597D  kTang           *xɑ̀u *xɑ̌u
U+597D  kVietnamese     háo
U+597D  kXHC1983        0445.030:hǎo 0448.030:hào

左列（“U+597D”）是 unicode 代码点，中间列是属性名称，右列是属性值。您可以提取 kHanyuPinyin 属性或 kMandarin 属性。它们编码的信息基本相同——只需使用您更容易处理的格式即可。（hǎo == HAO3，hào == HAO4，如果这不明显）

您会注意到，对于某些字符（例如我在这里选择的示例）有多种发音。这是一个棘手的问题。根据您想要的精度，您可能可以只使用列出的第一个罗马字母，因为它们是按频率递减的顺序排列的。（实际上，这是 kHanyuPinyin 与 kMandarin 有点不同的地方之一——它实际上有多个发音列表，每个列表都按频率排序。）

I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)

To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.

In it you'll find tab-delimited lines like this:

U+597D  kCantonese      hou2 hou3
U+597D  kDefinition     good, excellent, fine; well
U+597D  kHangul         호
U+597D  kHanyuPinlu     hao3(6060) hao1(142) hao4(115)
U+597D  kHanyuPinyin    21028.010:hǎo,hào
U+597D  kJapaneseKun    KONOMU SUKU YOI
U+597D  kJapaneseOn     KOU
U+597D  kKorean         HO
U+597D  kMandarin       HAO3 HAO4
U+597D  kTang           *xɑ̀u *xɑ̌u
U+597D  kVietnamese     háo
U+597D  kXHC1983        0445.030:hǎo 0448.030:hào

The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)

You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)

回复收藏 0 原文