检测字符是简体字还是繁体字

发布于 2024-10-10 17:19:35 字数 596 浏览 3 评论 0原文

我发现这个 问题 这给了我能够检查字符串是否包含中文字符。我不确定 unicode 范围是否正确,但它们似乎对日语和韩语返回 false,对中文返回 true。

它不会判断该字符是繁体字还是简体字。你会如何发现这一点?


更新

问:如何从 Unicode 字符的 32 位值中识别出中文、韩文或日文字符?

http://unicode.org/faq/han_cjk.html

他们的论点是字符无论其形状具有相同的含义,因此应由相同的代码表示。好吧,这对我来说并不是毫无意义,因为我正在分析不适合他们的解决方案的单个字符:

更好的解决方案是从整体上查看文本:如果有大量假名,则可能是日语,如果有大量韩文,则可能是韩语。

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.

What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?


update

Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

http://unicode.org/faq/han_cjk.html

Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

鸵鸟症 2024-10-17 17:19:36

我想你已经发现了,你不能。简体字和繁体字只是相同字符的两种书写方式——就像欧洲语言的罗马字体和哥特字体之间的区别一样。

As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.

夜声 2024-10-17 17:19:35

如前所述,您无法可靠地从单个字符检测脚本样式,但对于足够长的文本样本是可能的。请参阅 https://github.com/jpatokal/script_detector 来获取完成这项工作的 Ruby gem,以及简体中文 Unicode 表供一般性讨论。

As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.

〃安静 2024-10-17 17:19:35

对于某些字符来说是可能的。繁体字符集和简体字符集重叠,因此基本上具有三组字符:

  1. 仅繁体字符。
  2. 仅简化的字符。
  3. 角色未受影响,并且在两者中都可用。

以“面”字为例。它既属于 #2 又属于 #3...作为简化字符,它代表 ,脸和面条。而面只是一个繁体字。所以在Unihan数据库中,面有一个kSimplifiedVariant,它指向。所以你可以推断它只是一个繁体字。

但是也有一个kTraditionalVariant,它指向。这就是系统崩溃的地方:如果你用这些数据来推断“面”只是一个简化字符,那么你就错了……

另一方面, 有一个 kTraditionalVariant,指向 ,而这两个是“真正的”简体/繁体对。但 Unihan 数据库中没有任何内容可以区分韩/韩等案例与面/面等案例。

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:

  1. Characters that are traditional only.
  2. Characters that are simplified only.
  3. Characters that have been left untouched, and are available in both.

Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for and , face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to . So you can deduct that it is a traditional character only.

But also has a kTraditionalVariant, which points to . This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...

On the other hand, has a kTraditionalVariant, pointing to , and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文