I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)
To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.
The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)
You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)
Google translate includes "show/hide romanization" which is BETTER than UNIHAN for two reasons. First, known words are logically grouped together in the proper manner (at least it tries to do that). Secondly, Chinese characters have more than one possible pronunciation. It is not a trivial problem to figure out which pinyin transliteration is the right one. That's what the translation engine does.
发布评论
评论(3)
我不知道 Google AJAX 语言 API 是否支持转换为拼音,但如果不支持,那么在您的 on 上进行尚可的转换实际上并不难。 (从拼音到汉字(字符)的反向转换更加更加棘手,因为拼音的损耗很大。)
要自己进行转换,请获取 Unihan.zip,Unihan 数据库。您真正关心的文件是 Unihan_Readings.txt。它还包含一堆你不关心的东西,而且它的存储方式也相当低效,所以不要太担心大文件大小。您应该提取您关心的内容并以更有效的方式存储它。
在其中您将找到如下所示的制表符分隔行:
左列(“U+597D”)是 unicode 代码点,中间列是属性名称,右列是属性值。您可以提取 kHanyuPinyin 属性或 kMandarin 属性。它们编码的信息基本相同——只需使用您更容易处理的格式即可。 (hǎo == HAO3,hào == HAO4,如果这不明显)
您会注意到,对于某些字符(例如我在这里选择的示例)有多种发音。这是一个棘手的问题。根据您想要的精度,您可能可以只使用列出的第一个罗马字母,因为它们是按频率递减的顺序排列的。 (实际上,这是 kHanyuPinyin 与 kMandarin 有点不同的地方之一——它实际上有多个发音列表,每个列表都按频率排序。)
I don't know if Google AJAX Language APIs have support for converting to pinyin, but if they don't it actually isn't too hard to do a passable conversion on your on. (The reverse conversion, from pinyin to hanzi (characters) is much more tricky, because pinyin is very lossy.)
To do the conversion yourself, grab the Unihan.zip, a downloaable verion of the Unihan database. The file you actually care about is Unihan_Readings.txt. It also contains a bunch of stuff you don't care about, and it's also stored in a pretty inefficient way, so don't be too worried about the large file sizes. You should extract the stuff you care about and store it in a more efficient way.
In it you'll find tab-delimited lines like this:
The left column ("U+597D") is the unicode codepoint, the middle column is an attribute name, and the right column is the attribute value. You can extract either the kHanyuPinyin attributes or the kMandarin attributes. They encode basically the same information -- just go with whichever is an easier format for you to deal with. (hǎo == HAO3, hào == HAO4, if that isn't obvious)
You'll note that for some characters (like the example I've chosen here) there are multiple pronunciations. This is the one tricky bit. Depending on how much precision you want, you may be able to get away with just using the first romanization listed, as they're in order of decreasing frequency. (Actually, this is one of the places where kHanyuPinyin is a bit different from kMandarin -- it actually has multiple lists of pronunciations, each ordered by frequency.)
您可以欺骗 API 通过将中文翻译成中文来为您提供拼音。示例 链接。
You can trick the API into giving you Pinyin by translating from Chinese to Chinese. Sample link.
谷歌翻译包括“显示/隐藏罗马化”,这比 UNIHAN 更好,原因有两个。首先,已知的单词以适当的方式在逻辑上分组在一起(至少它试图这样做)。其次,汉字有不止一种可能的读音。找出哪个拼音音译才是正确的并不是一个小问题。这就是翻译引擎的作用。
Google translate includes "show/hide romanization" which is BETTER than UNIHAN for two reasons. First, known words are logically grouped together in the proper manner (at least it tries to do that). Secondly, Chinese characters have more than one possible pronunciation. It is not a trivial problem to figure out which pinyin transliteration is the right one. That's what the translation engine does.