将汉字转换为汉语拼音

发布于 2024-10-06 04:08:48 字数 292 浏览 5 评论 0原文

如何将汉字转换为汉语拼音?

例如

你-->妮马

--> Mǎ


更多信息:

汉语拼音的重音或数字形式都可以接受,数字形式是我的偏好。

Java 库是首选,但是,可以放入包装器中的其他语言的库也可以。

我希望任何曾经亲自使用过这样一个库的人都可以就其质量/可靠性来推荐或评论它。

How to convert from chinese characters to hanyu pinyin?

E.g.

你 --> Nǐ

马 --> Mǎ


More Info:

Either accents or numerical forms of hanyu pinyin are acceptable, the numerical form being my preference.

A Java library is preferred, however, a library in another language that can be put in a wrapper is also OK.

I would like anyone who has personally used such a library before to recommend or comment on it, in terms of its quality/ reliabilitty.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谎言月老 2024-10-13 04:08:48

汉字转拼音的问题是一个相当困难的问题。有许多汉字字符具有多种拼音表示形式,具体取决于上下文。将“长大”(拼音:zhang da)与“长城”(拼音:chang Cheng)进行比较。因此,单字符转换实际上通常是无用的,除非您有一个输出多种可能性的系统。还有分词的问题,它也会影响拼音的表示。虽然也许你已经知道这一点,但我认为说这一点很重要。

也就是说,Adso 包 包含一个分段器和一个概率拼音注释器,基于优秀的 Adso图书馆。不过,它需要一段时间才能适应,并且可能比您想要的大得多(我过去发现它对于我的需求来说有点太大了)。此外,似乎没有任何地方有公共 API 及其 C++ ...

对于最近的一个项目,因为我正在处理地名,所以我只是使用了 Google Translate API(具体来说,非官方的 java 端口,至少对于普通名词来说,通常可以很好地翻译成拼音。问题是常用的替代音译系统,例如“香港”应该是“香港”,考虑到所有这些,谷歌翻译相当有限。但它提供了一个开始,我以前没有听说过 pinyin4j,但是在刚刚使用它之后,我发现它并不是最佳的——虽然它输出了一个潜在的候选拼音罗马化列表,但它并没有尝试进行统计。有一种方法可以返回单个表示,但它很快就会被淘汰,因为它目前只返回第一个罗马化,而不是最有可能的程序在罗马化和通用之间的转换。可配置性。

简而言之,答案可能是其中任何一个,具体取决于您的需要。特殊的专有名词?谷歌翻译。需要统计吗?阿德索。愿意接受没有上下文信息的候选人名单吗?拼音4j.

The problem of converting hanzi to pinyin is a fairly difficult one. There are many hanzi characters which have multiple pinyin representations, depending on context. Compare 长大 (pinyin: zhang da) to 长城 (pinyin: chang cheng). For this reason, single-character conversion is often actually useless, unless you have a system that outputs multiple possibilities. There is also the issue of word segmentation, which can affect the pinyin representation as well. Though perhaps you already knew this, I thought it was important to say this.

That said, the Adso Package contains both a segmenter and a probabilistic pinyin annotator, based on the excellent Adso library. It takes a while to get used to though, and may be much larger than you are looking for (I have found in the past that it was a bit too bulky for my needs). Additionally, there doesn't appear to be a public API anywhere, and its C++ ...

For a recent project, because I was working with place names, I simply used the Google Translate API (specifically, the unofficial java port, which, for common nouns at least, usually does a good job of translating to pinyin. The problem is commonly-used alternative transliteration systems, such as "HongKong" for what should be "XiangGang". Given all of this, Google Translate is pretty limited, but it offers a start. I hadn't heard of pinyin4j before, but after playing with it just now, I have found that it is less than optimal--while it outputs a list of potential candidate pinyin romanizations it makes no attempt to statistically determine their likelihood. There is a method to return a single representation, but it will soon be phased out, as it currently only returns the first romanization, not the most likely. Where the program seems to do well is with conversion between romanizations and general configurability.

In short then, the answer may be either any one of these, depending on what you need. Idiosyncratic proper nouns? Google Translate. In need of statistics? Adso. Willing to accept candidate lists without context information? Pinyin4j.

爱冒险 2024-10-13 04:08:48

在Python中尝试

from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
cjk.getReadingForCharacter(u'北', 'Pinyin')

你会得到

['běi', 'bèi']

免责声明:我是该库的作者。

In Python try

from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
cjk.getReadingForCharacter(u'北', 'Pinyin')

You would get

['běi', 'bèi']

Disclaimer: I'm the author of that library.

顾北清歌寒 2024-10-13 04:08:48

对于 Java,我会尝试 pinyin4j 库

For Java, I'd try the pinyin4j library

叫思念不要吵 2024-10-13 04:08:48

正如其他答案中提到的,转换是模糊的,甚至谷歌翻译显然也会得到一定比例的字符组合错误。

使用某些编程语言可用的开源库可以获得合理的结果,但不是 100% 准确。

使用 pypinyin 库使用 python 进行转换的最简单代码(使用 pip3 install pypinyin 安装它):

from pypinyin import pinyin


def to_pinyin(chin):
    return ' '.join([seg[0] for seg in pinyin(chin)])


print(to_pinyin('好久不见'))
# OUTPUT: hǎo jiǔ bú jiàn

注意:模块中的 pinyin 方法返回可能的候选段列表,并且只要有多个转换可用,to_pinyin 方法就会采用第一个变体。对于棘手的极端情况,这可能会产生不正确的结果,但通常您可能会获得至少 ~90..95% 的成功率。

还有一些其他用于拼音转换的 python 库,但在我的测试中,它们被证明比 pypinyin 具有更高的错误率。此外,它们似乎没有得到积极维护。

如果您需要更高的准确性,那么您将需要一种更复杂的方法,该方法将依赖于更大的数据集,可能还需要一些机器学习。

As mentioned in other answers the conversion is fuzzy and even google translate apparently gets a certain percentage of character combinations wrong.

A reasonable result which will not be 100% accurate can be achieved with open-source libraries available for some programming languages.

The simplest code to do the conversion with python with the pypinyin library (to install it use pip3 install pypinyin):

from pypinyin import pinyin


def to_pinyin(chin):
    return ' '.join([seg[0] for seg in pinyin(chin)])


print(to_pinyin('好久不见'))
# OUTPUT: hǎo jiǔ bú jiàn

NOTE: The pinyin method from the module returns a list of possible candidate segments, and the to_pinyin method takes the first variant whenever more than one conversion is available. For tricky corner cases this is likely to produce incorrect results, but generally you'll probably get at least a ~90..95% success rate.

There are a few other python libraries for pinyin conversion but in my tests they proved to have a higher error rate than pypinyin. Also, they don't appear to be actively maintained.

If you need better accuracy then you'll need a more complex approach that will rely on bigger datasets and possibly some machine learning.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文