简体中文Unicode表
在哪里可以找到仅显示简体中文字符的 Unicode 表? 我到处寻找但什么也没找到。
更新:
我发现还有另一种编码,称为 GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- 其中仅包含简化字符。
我当然可以用它来得到我需要的东西吗?
我还发现了这个将 GB2312 映射到 Unicode 的文件 -
http://cpansearch.perl.org/src/GUS/Unicode -UTF8simple-1.06/gb2312.txt
- 但我不确定它是否准确。
如果该表不正确,也许有人可以向我指出一个正确的表,或者只是一张 GB2312 字符表以及某种转换它们的方法?
更新2:
该网站还提供了 GB/Unicode 表,甚至还提供了用于生成文件的 Java 程序 包含所有 GB 字符以及 Unicode 等效字符:
http://www.herongyang.com/gb2312/
Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
Unihan 数据库 在文件
Unihan_Variants.txt
中包含此信息。例如,一对繁体/简体字符为:在上述情况下,U+6A5F 是机,即机的繁体形式 (U+673A)。
另一种方法是使用 CC-CEDICT 项目,该项目发布了汉字和复合词词典(两者)传统的和简化的)。每个条目看起来都是这样的:
第一栏为繁体字,第二栏为简体字。
要获取所有简化字符,请阅读此文本文件并列出第二列中出现的每个字符。请注意,某些字符可能不会单独出现(仅在复合词中出现),因此仅查看单字符条目是不够的。
The Unihan database contains this information in the file
Unihan_Variants.txt
. For example, a pair of traditional/simplified characters are:In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
OP 没有表明他们使用的是哪种语言,但如果你使用 Ruby,我写了一个 小库,可以区分简体中文和繁体中文(还有韩语和日语作为奖励)。正如 Greg 的回答中所建议的,它依赖于
Unihan_Variants.txt
的精炼版本来确定哪些字符完全是简化的,哪些字符完全是传统的。https://github.com/jpatokal/script_detector
示例:
但如 Unicode FAQ 正式警告,这需要相当大的文本片段才能可靠地工作,并且对于短字符串会给出误导性的结果。考虑东京的日语:
由于这两个字符碰巧也是有效的繁体中文,并且没有专门的日语字符,因此无法正确识别。
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of
Unihan_Variants.txt
to figure out which chars are exclusively simplified and which are exclusively traditional.https://github.com/jpatokal/script_detector
Sample:
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
我不确定这是否容易做到。汉字表意文字已统一为 Unicode,因此如何操作并不明显。但是 Unihan 数据库 (http://www.unicode.org/charts/unihan.html) 可能有您需要的数据。
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
这是我制作的所有简体中文字符的正则表达式。由于某种原因 Stackoverflow 正在抱怨,所以它被链接到下面的一个 Pastebin 中。
https://pastebin.com/xw4p7RVJ
您会注意到此列表包含范围而不是每个单独的字符,而且这些是 utf-8 字符,而不是转义表示。自 2010 年左右以来,它在一次又一次的迭代中对我很有帮助。希望其他人现在也能使用它。
如果你不想要简化的字符(我无法想象为什么,它不会9年出现一次),迭代
['一-龥']
中的所有字符并尝试建立一个新列表。或者运行两个正则表达式,一个检查它是中文,但不是简体中文Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from
['一-龥']
and try to build a new list. Or run two regexes, one to check it is Chinese, but is not simplified Chinese根据 wikipedia 简体中文与繁体中文、汉字或其他格式由字体渲染决定在很多情况下。因此,虽然您可以选择简体中文代码点,但此列表根本不完整,因为许多字符不再不同。
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
我不相信有一个表只包含简化的代码点。我认为它们都集中在 CJK 范围 0x4E00 到 0x9FFF
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF