将带有重音符号的汉语拼音转换为数字形式
我希望将用重音符号书写的拼音(例如:Nín hǎo)转换为以数字/ASCII 形式书写的拼音(例如:Nin2 hao1)。
有谁知道有什么库可以做到这一点,最好是 PHP 吗?或者懂中文/拼音可以发表评论吗?
我开始自己写一个相当简单的,但我不会说中文,也不完全理解何时应该用空格分隔单词的规则。
我能够编写一个翻译器来转换:
Nín hǎo。 Wǒ shì zhōng guó rén
==> Nin2 hao3。 Wo3 shi4 zhong1 guo2 ren2
但是你如何处理像下面这样的单词 - 它们是用空格分割成多个单词,还是在单词中插入声调数字(如果是,在哪里?): huā shíjiān
、wèishénme
、yuèláiyuè
、shēngbìng
等。
I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).
Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?
I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.
I was able to write a translator that converts:
Nín hǎo. Wǒ shì zhōng guó rén
==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2
But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :huā shíjiān
, wèishénme
, yuèláiyuè
, shēngbìng
, etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在没有空格分隔每个单词的情况下解析拼音的问题是会出现歧义。以中国古都的名称长安为例:Chang'ān(注意消除歧义的撇号)。但是,如果我们去掉撇号,则可以用两种方式解释:
Chán gān
或Cháng ān
。中国人会告诉你,第二种可能性更大,当然取决于上下文,但你的计算机无法做到这一点。假设没有歧义,并且所有输入都是有效的,我会这样做的方式看起来像这样:
无论如何,声调的数字表示的正确位置以及代表每个声调的正确数字维基百科关于拼音的文章的这一部分很好地涵盖了重音: http://en.wikipedia.org/wiki/Pinyin #Numerals_in_place_of_tone_marks。您可能还想了解一下 IME 是如何工作的。
The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways:
Chán gān
orCháng ān
. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:
Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.
间距应保持不变,但音调编号不正确。
宁2好3。我是中国人。
wèishénme 变成 wei4shen2me。
Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.
wèishénme becomes wei4shen2me.