在Java中,字符的int值应该称为它的ASCII值还是Unicode值

发布于 2025-01-02 13:29:38 字数 183 浏览 1 评论 0原文

我正在开发一个仅处理大写字母的 Java 程序。在某些处理过程中,我使用这些大写字母的字符的 int 值。我知道大写字母的值在 Unicode 和 ASCII 中是相同的,但是当引用这些 int 值时,我应该说它们是 Unicode 值还是 ASCII 值?我只是想确保我在语言方面使用了正确的术语。

I am working on a program in Java that only deals with capital letters. During some processing, I am using the int value of chars of these capital letters. I understand that the value of the capital letters are the same in Unicode and ASCII, but when referring to these int values, should I be saying that they are the Unicode values or the ASCII values? I just want to make sure that I'm using the correct terminology in terms of the language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

假装不在乎 2025-01-09 13:29:38

它应该被称为Unicode代码单元。 Java char 是一个 16 位 Unicode 代码单元,而不是 32 位 Unicode 代码点(最初认为 Unicode 是 16 位)。无论值是什么,它总是占用 16 位。 ASCII 是 7 位(如果考虑 0 填充/错误检查位,则为 8 位)。因此,即使实际值相同,该术语也不完全适用。

It should be referred to as a Unicode code unit. A Java char is a 16-bit Unicode code unit, as opposed to a 32-bit Unicode code point (it was originally thought that Unicode would be 16-bit). It will always take 16 bits, regardless of what the value is. ASCII is 7-bit (8 if you consider the 0 padding/error-checking bit). Thus, the term doesn't fully apply even if the actual value is the same.

绅士风度i 2025-01-09 13:29:38

如果字符仅是 ASCII,您可以将它们称为 ASCII。否则,您应该使用术语 Unicode,正如您所说,它是 ASCII 的适当超集。请记住,即使您将它们称为为 ASCII,但如果您将它们发送到需要真实(基于八位字节的)的内容,则可能需要更改编码 ) ASCII。

如果您的软件仅处理 ASCII 范围内的代码点(见下文,这通常不是一个好主意),那么(对用户或在文档中)说“ASCII 值”比“Unicode 值”要容易得多。 ASCII 范围”:-)

如果您只处理 ASCII 范围内的大写字母,那么在处理大写字母的上下文中将您的值称为 Unicode 代码点实际上是误导

如今,任何新软件在编写时都应考虑到 Unicode,并且包括大写字母不限于 ASCII 范围这一事实。

例如,有一大块希腊字符靠近具有大写和小写属性的 ASCII 范围。 SpecialCasing.txt 文件显示了这些属性,并且有还有关于该主题的常见问题解答

If the characters will only ever be ASCII, you can refer to them as ASCII. Otherwise, you should use the term Unicode which, as you state, is a proper superset of ASCII. Keep in mind that, even though you refer to them as ASCII, the encoding may need to be changed if you're sending them to something that expects real (octet-based) ASCII.

If you're software only handles code points in the ASCII range (and see below, this is not usually a good idea), it's much easier to say (to users, or in the documentation) "ASCII values" than "Unicode values in the ASCII range" :-)

It's actually misleading to refer to your values as Unicode code points in the context of doing things to uppercase letters, if you only handle the uppercase letters in the ASCII range.

Any new software nowadays should be written with Unicode in mind, and that includes the fact the uppercase letters are not restricted to the ASCII range.

For example, there's a chunk of Greek characters nowhere near the ASCII range that have upper and lowercase properties. The SpecialCasing.txt file shows these properties and there's also a FAQ on the subject.

横笛休吹塞上声 2025-01-09 13:29:38

根据 Unicode 词汇表,数字代码的正确且正确的术语是其 代码点。例如:

  • DIGIT ONE 的代码点为 3116 (4910),通常写作 U+0031。
  • POUND SIGN 的代码点是 U+00A3
  • LATIN SMALL LETTER I with DIAERESIS 的代码点是 U+00EF。
  • 希腊小写字母 MU 的代码点是 U+03BC。
  • 上面带有点的拉丁文小写字母 F 的代码点是 U+1E1F。
  • REPLACMENT CHARACTER 的代码点是 U+FFFD。
  • MUSICAL SYMBOL DOUBLE FLAT 的代码点是 U+1D12B。
  • 数学斜体大写字母 R 的代码点是 U+1D445。
  • EXTRATERRESTRIAL ALIEN 的代码点是 U+1F47D。
  • U+100002 是 Supplementary_Private_Use_Area_B 块中分配的代码点。
  • 代码点 U+0041 的分配名称是拉丁大写字母 A
  • 代码点 U+1F47E 的分配名称是ALIEN MONSTER
  • 代码点 U+0FFE 未分配,因此没有名称。

等等。

The correct and proper term according the Unicode Glossary for the numeric code is its code point. For example:

  • The code point for DIGIT ONE is 3116 (4910), normally written U+0031.
  • The code point for POUND SIGN is U+00A3
  • The code point for LATIN SMALL LETTER I WITH DIAERESIS is U+00EF.
  • The code point for GREEK SMALL LETTER MU is U+03BC.
  • The code point for LATIN SMALL LETTER F WITH DOT ABOVE is U+1E1F.
  • The code point for REPLACEMENT CHARACTER is U+FFFD.
  • The code point for MUSICAL SYMBOL DOUBLE FLAT is U+1D12B.
  • The code point for MATHEMATICAL ITALIC CAPITAL R is U+1D445.
  • The code point for EXTRATERRESTRIAL ALIEN is U+1F47D.
  • U+100002 is an assigned code point in the Supplementary_Private_Use_Area_B block.
  • The assigned name of code point U+0041 is LATIN CAPITAL LETTER A.
  • The assigned name of code point U+1F47E is ALIEN MONSTER.
  • Code point U+0FFE is unassigned, and so has no name.

And so on and so forth.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文