合理的日语支持的最小 unicode 字符集是什么?

发布于 2024-07-15 19:58:33 字数 1202 浏览 5 评论 0原文

我有一个移动应用程序需要为日本受众移植。 该应用程序的一部分是自定义字体文件,需要将其从仅包含 latin-1 字符扩展到还包含日语字符。 我意识到这会使它变得相当大,但这不是今天的问题。

请注意,我无法控制此应用程序要显示的文本,因此它需要能够支持足够的功能才能显示用户生成的内容。

我认为这是最大的 unicode 范围集,可以涵盖所需的任何内容。

 Compatability                         U+3300  -  U+33FF
 Compatability forms                   U+FE30  -  U+FE4F
 Compatability ideographs              U+F900  -  U+FAFF
 Compatability ideographs supplement  U+2F800  - U+2FA1F
 Radicals supplement                   U+2E80  -  U+2EFF
 Strokes                               U+31C0  -  U+31EF
 Symbols and punctuation               U+3000  -  U+303F
 Unified Ideographs                    U+4E00  -  U+9FBB
 Unified Ideographs ext. A             U+3400  -  U+4DB5
 Unified Ideographs ext. B            U+20000  - U+2A6D6
 Enclosed letters and months           U+3200  -  U+32FF
 Hiragana                              U+3040  -  U+309F
 Kanbun                                U+3190  -  U+319F
 Katakana                              U+30A0  -  U+30FF
 Katakana phonetic                     U+31F0  -  U+31FF

我需要知道的是:

  • 这个列表中是否缺少任何内容?
  • 有什么明显不需要的吗?
  • 有什么东西可以说是非必要的吗?为什么可以这么说?

I have a mobile application that needs to be ported for a Japanese audience. Part of the application is a custom font file that needs to be extended from only containing latin-1 characters to also containing Japanese characters. I realise that this will make it rather large, but that is not todays problem.

Note that I have no control over the text to be displayed by this application, so it needs to be able to support enough to be able to display user-generated content.

Here is what I believe to be a maximal set of unicode ranges that would cover anything required of it.

 Compatability                         U+3300  -  U+33FF
 Compatability forms                   U+FE30  -  U+FE4F
 Compatability ideographs              U+F900  -  U+FAFF
 Compatability ideographs supplement  U+2F800  - U+2FA1F
 Radicals supplement                   U+2E80  -  U+2EFF
 Strokes                               U+31C0  -  U+31EF
 Symbols and punctuation               U+3000  -  U+303F
 Unified Ideographs                    U+4E00  -  U+9FBB
 Unified Ideographs ext. A             U+3400  -  U+4DB5
 Unified Ideographs ext. B            U+20000  - U+2A6D6
 Enclosed letters and months           U+3200  -  U+32FF
 Hiragana                              U+3040  -  U+309F
 Kanbun                                U+3190  -  U+319F
 Katakana                              U+30A0  -  U+30FF
 Katakana phonetic                     U+31F0  -  U+31FF

What I need to know is:

  • Is anything missing from this list?
  • Is anything obviously not required?
  • Is anything arguably non-essential, and why could it be argued as such?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

束缚m 2024-07-22 19:58:33

基本字符摘要

Enclosed Alphanumerics                U+2460  -  U+2473
            "                         U+2474  -  U+24E9*
            "                         U+24EA  -  U+24FF
Miscellaneous Symbols                 U+2600  -  U+2607
            "                         U+2618  -  U+2618
            "                         U+260E  -  U+260F
            "                         U+2614  -  U+2615
            "                         U+263D  -  U+2653
            "                         U+2660  -  U+266F
Symbols and punctuation               U+3000  -  U+303F
Hiragana                              U+3040  -  U+309F
Katakana                              U+30A0  -  U+30FF
Katakana phonetic                     U+31F0  -  U+31FF
Enclosed letters and months           U+321F  -  U+325F*
            "                         U+3280  -  U+32FF*
Unified Ideographs ext. A             U+3400  -  U+4DB5
Unified Ideographs                    U+4E00  -  U+9FBB
Compatability ideographs              U+F900  -  U+FAFF
Compatability forms                   U+FE30  -  U+FE4F
Full-Width Roman                      U+FF00  -  U+FF5E
Half-Width Katakana                   U+FF61  -  U+FF9F
Full- and Half-Width Symbols          U+FFE0  -  U+FFEE
Unified Ideographs ext. B            U+20000  - U+2A6D6
Compatability ideographs supplement  U+2F800  - U+2FA1F

* = Lower priority

完整说明

不要忘记全角罗马字符,它常用于日语中的罗马字母 (FF00-FF5E) 和半角片假名页面 (FF61-FF9F)。 您可能还需要全角和半角符号 (FFE0-FFEE)。

可以说,一般不会使用 Kanbun 注释页 (3190-319F)。 Kanbun 是一种古老的日语风格,它使用所有汉字(没有平假名或片假名),并具有一套不同的语法规则,通常在学校教授。 除非有人试图解释如何阅读/理解这些段落之一(这可能不太可能),否则不会使用这些注释标记。 出于完整性考虑,可以将其包括在内,但可能不是高优先级。

CJK 兼容性 (3300-33FF) 通常由印刷媒体中的报纸使用,但几乎肯定不会被普通公众使用(我还没有在网站上看到过)。 无论哪种情况,它们都有等效的长形式(例如㌘可以写成グラム),所以这也属于非必需类别。

CJK 部首补充 (2E80-2EFF) 也不是必需的,但可以使用。 它们不是完整的汉字,而是汉字的“部首”(基本部分)。 它们可以用于解释字符的派生,但不太可能在语言的正常应用中使用。

CJK 笔画 (31C0-31E3) 与 CJK 部首补充相同,在日常应用中使用的可能性可能更小。

所附 CKJ 信件和月份 (3200-321E) 的第一部分是不必要的。 它们是韩国的象征。 与 (3260-327F) 相同。 该页面的其余部分的使用率较低,但为了完整起见,我将其包括在内,因为有人可能会偶尔尝试使用其中一个。 但您可以将它们视为较低优先级。

您在原始列表中列出的其余内容都是必不可少的。

列表中还缺少封闭的字母数字 (2460-24FF)。 带圆圈的数字(2460-2473 和 24EA-24FF)使用相对频繁。 然而,带圆圈的字母、括号内的数字和数字句点 (2474-24E9) 可以省略,因为它们不是必需的。

另外,您最好包含杂项符号 (2600-263C),尽管有些符号比其他符号使用得更频繁。 绝对重要的包括一些天气符号(2600-2607)、三叶草(2618)、电话(260E-260F)、雨伞和热饮(2614-2615)、占星和黄道带符号(263D-2653)以及玩耍卡片、温泉和音乐符号 (2660-266F)。

Summary of Essential Characters

Enclosed Alphanumerics                U+2460  -  U+2473
            "                         U+2474  -  U+24E9*
            "                         U+24EA  -  U+24FF
Miscellaneous Symbols                 U+2600  -  U+2607
            "                         U+2618  -  U+2618
            "                         U+260E  -  U+260F
            "                         U+2614  -  U+2615
            "                         U+263D  -  U+2653
            "                         U+2660  -  U+266F
Symbols and punctuation               U+3000  -  U+303F
Hiragana                              U+3040  -  U+309F
Katakana                              U+30A0  -  U+30FF
Katakana phonetic                     U+31F0  -  U+31FF
Enclosed letters and months           U+321F  -  U+325F*
            "                         U+3280  -  U+32FF*
Unified Ideographs ext. A             U+3400  -  U+4DB5
Unified Ideographs                    U+4E00  -  U+9FBB
Compatability ideographs              U+F900  -  U+FAFF
Compatability forms                   U+FE30  -  U+FE4F
Full-Width Roman                      U+FF00  -  U+FF5E
Half-Width Katakana                   U+FF61  -  U+FF9F
Full- and Half-Width Symbols          U+FFE0  -  U+FFEE
Unified Ideographs ext. B            U+20000  - U+2A6D6
Compatability ideographs supplement  U+2F800  - U+2FA1F

* = Lower priority

Full Explanation

Don't forget the full-width Roman, which are used often for the Roman alphabet in Japanese (FF00-FF5E) and half-width Katakana pages (FF61-FF9F). You will probably also need the full- and half-width symbols (FFE0-FFEE).

An argument can be made that the Kanbun annotation page (3190-319F) will generally not be used. Kanbun is and old style of Japanese which uses all Chinese characters (no Hiragana or Katakana) with a different set of grammar rules, generally taught at school. These annotation marks will not be used unless someone is trying to explain how to read/understand one of these passages, which is probably unlikely. It could be included for completeness, but probably is not a high priority.

CJK Compatability (3300-33FF) is generally used by newspapers in print media, but will almost certainly not be used by the average public (I have yet to see one on a website). In either event, they have equivalent long forms (e.g. ㌘ can be written as グラム instead), so this is also in the non-essential category.

CJK Radicals Supplement (2E80-2EFF) is also non-essential, but could be used. They are not complete characters, but the "radical" (base part) of characters. They could be used to explain the derivation of a character, but unlikely to be used in normal application of the language.

CJK Strokes (31C0-31E3) is the same as the CJK Radicals Supplement, and probably has an even less likelyhood of being used in everyday application.

The first part of Enclosed CKJ Letters and Months (3200-321E) are unnecessary. They are Korean symbols. Same with (3260-327F). The rest of the page has a low usage rate, but I would include it for completeness because someone will probably try to use one occasionally. But you can consider them lower priority.

The rest you have called out in your original list are essential.

Also missing from the list is Enclosed Alphanumerics (2460-24FF). The circled numbers (2460-2473 and 24EA-24FF) are used relatively frequently. The circled alphabet, parenthesized numbers, and numbers period (2474-24E9) could be omitted as non-essential, however.

Also, you would do well to include Miscellaneous Symbols (2600-263C), although some are used more often than others. Absolutely essential ones include some of the weather symbols (2600-2607), shamrock (2618), the telephones (260E-260F), umbrella and hot drink (2614-2615), Astrological and Zodiac symbols (263D-2653), and playing cards, hot springs, and musical symbols (2660-266F).

命比纸薄 2024-07-22 19:58:33

从技术上讲,您应该包括:
1. 阿拉伯数字(0,1..9)
2. 英文标点符号(!"#$%'...)
3. 罗马字母(A..Z、a..z)(半角和全角)

1-3 基本上表示支持 ASCII。

  1. 平假名
  2. 片假名
  3. 日语标点符号
  4. 常用汉字(这是日本政府批准用于报纸等的约 2000 个汉字的列表)
  5. 姓名汉字(日本政府为专有名称编制的另一个列表)。

总之,这将为您提供 2600 个汉字或类似的东西,您将能够表示您可以在网络上找到的大多数正常内容。 有一些小的例外,其中的字符在 Joyo 中很常见,但在 Joyo (fe 沢) 中却没有。

问题在于 Unicode 并未完全围绕 Joyo 汉字列表进行组织,因此您必须在范围内进行选择。 包含日语中存在的每个汉字可能更容易,即使它不经常使用或不属于 Joyo 的一部分。

Technically speaking, you should include:
1. Arabic numerals (0,1..9)
2. English Punctuation (!"#$%'...)
3. Roman letters (A..Z, a..z) (Half-Width, and Full-Width)

1-3 basically means ASCII support.

  1. Hiragana
  2. Katakana
  3. Japanese Punctuation
  4. Joyo Kanji (This is the list of about 2000 Kanji approved by the Japanese government for use in newspapers, etc.)
  5. Name Kanji (Another list compiled by the Japanese government for proper names).

All together, that will give you 2600 kanji or something like that, and you will be able to represent most normal stuff you could find on the web. There are some minor exceptions where characters are common but not in Joyo (f.e. 沢).

The problem is that Unicode isn't exactly organized around the Joyo kanji list, so you would have to pick and choose within the ranges. It's probably just easier to include every kanji that exists in Japanese, even if it isn't frequently used or part of Joyo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文