将带有重音符号的汉语拼音转换为数字形式

发布于 2024-10-01 11:36:38 字数 463 浏览 10 评论 0原文

我希望将用重音符号书写的拼音(例如:Nín hǎo)转换为以数字/ASCII 形式书写的拼音(例如:Nin2 hao1)。

有谁知道有什么库可以做到这一点,最好是 PHP 吗?或者懂中文/拼音可以发表评论吗?

我开始自己写一个相当简单的,但我不会说中文,也不完全理解何时应该用空格分隔单词的规则。

我能够编写一个翻译器来转换:

Nín hǎo。 Wǒ shì zhōng guó rén ==> Nin2 hao3。 Wo3 shi4 zhong1 guo2 ren2

但是你如何处理像下面这样的单词 - 它们是用空格分割成多个单词,还是在单词中插入声调数字(如果是,在哪里?): huā shíjiānwèishénmeyuèláiyuèshēngbìng等。

I'm looking to convert Pinyin where the tone marks are written with accents (e.g.: Nín hǎo) to Pinyin written in numerical/ASCII form (e.g.: Nin2 hao1).

Does anyone know of any libraries for this, preferably PHP? Or know Chinese/Pinyin well enough to comment?

I started writing one myself that was rather simple, but I don't speak Chinese and don't fully understand the rules of when words should be split up with a space.

I was able to write a translator that converts:

Nín hǎo. Wǒ shì zhōng guó rén ==> Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2

But how do you handle words like the following - do they get split up with a space into multiple words, or do you interject the tone numbers within the word (if so, where?) :
huā shíjiān, wèishénme, yuèláiyuè, shēngbìng, etc.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

故事灯 2024-10-08 11:36:38

在没有空格分隔每个单词的情况下解析拼音的问题是会出现歧义。以中国古都的名称长安为例:Chang'ān(注意消除歧义的撇号)。但是,如果我们去掉撇号,则可以用两种方式解释:Chán gānCháng ān。中国人会告诉你,第二种可能性更大,当然取决于上下文,但你的计算机无法做到这一点。

假设没有歧义,并且所有输入都是有效的,我会这样做的方式看起来像这样:

  1. 创建重音折叠函数
  2. 创建有效拼音的数组(您应该从拼音的维基百科页面获取它)
  3. 将每个单词与有效拼音列表
  4. 当对最后一个字符属于下一个单词的可能性有歧义时,先检查下一个单词,例如:
 shēngbìng
     ^ Does this 'g' belong to the next word?
 

无论如何,声调的数字表示的正确位置以及代表每个声调的正确数字维基百科关于拼音的文章的这一部分很好地涵盖了重音: http://en.wikipedia.org/wiki/Pinyin #Numerals_in_place_of_tone_marks。您可能还想了解一下 IME 是如何工作的。

The problem with parsing pinyin without the space separating each word is that there will be ambiguity. Take, for instance, the name of an ancient Chinese capital 长安: Cháng'ān (notice the disambiguating apostrophe). If we strip out the apostrophe however this can be interpreted in two ways: Chán gān or Cháng ān. A Chinese would tell you that the second is far more likely, depending on the context of course, but there's no way your computer can do that.

Assuming no ambiguity, and that all input are valid, the way I would do it would look something like this:

  1. Create accent folding function
  2. Create an array of valid pinyin (You should take it from the Wikipedia page for pinyin)
  3. Match each word to the list of valid pinyin
  4. Check ahead to the next word when there is ambiguity about the possibility of the last character belonging to the next word, such as:
 shēngbìng
     ^ Does this 'g' belong to the next word?
 

Anyway, the correct positioning of the numerical representation of the tones, and the correct numerals to represent each accent are covered fairly well in this section of the Wikipeda article on pinyin: http://en.wikipedia.org/wiki/Pinyin#Numerals_in_place_of_tone_marks. You might also want to have a look at how IMEs do their job.

相守太难 2024-10-08 11:36:38

间距应保持不变,但音调编号不正确。
宁2好3。我是中国人。

wèishénme 变成 wei4shen2me。

  1. 通过将“āáǎà”映射为“a”等来去除变音符号。
  2. 使用简单的最大匹配算法,将复合词拆分为音节(普通话音节只有418个左右)。
  3. 附加数字(您必须记住删除了哪种标记)并将音节重新连接成复合词。

Spacing should stay the same, but you got numbering of tones incorrectly.
Nin2 hao3. Wo3 shi4 zhong1 guo2 ren2.

wèishénme becomes wei4shen2me.

  1. Remove diacritical marks by mapping "āáǎà" to "a", etc.
  2. Using simple maximum matching algorithm, split compounds into syllables (there are only 418 or so Mandarin syllables).
  3. Append numbers (you have to remember what kind of mark you removed) and joing syllables back into compounds.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文