在Java中,字符的int值应该称为它的ASCII值还是Unicode值
我正在开发一个仅处理大写字母的 Java 程序。在某些处理过程中,我使用这些大写字母的字符的 int
值。我知道大写字母的值在 Unicode 和 ASCII 中是相同的,但是当引用这些 int
值时,我应该说它们是 Unicode 值还是 ASCII 值?我只是想确保我在语言方面使用了正确的术语。
I am working on a program in Java that only deals with capital letters. During some processing, I am using the int
value of chars of these capital letters. I understand that the value of the capital letters are the same in Unicode and ASCII, but when referring to these int
values, should I be saying that they are the Unicode values or the ASCII values? I just want to make sure that I'm using the correct terminology in terms of the language.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
它应该被称为Unicode代码单元。 Java
char
是一个 16 位 Unicode 代码单元,而不是 32 位 Unicode 代码点(最初认为 Unicode 是 16 位)。无论值是什么,它总是占用 16 位。 ASCII 是 7 位(如果考虑 0 填充/错误检查位,则为 8 位)。因此,即使实际值相同,该术语也不完全适用。It should be referred to as a Unicode code unit. A Java
char
is a 16-bit Unicode code unit, as opposed to a 32-bit Unicode code point (it was originally thought that Unicode would be 16-bit). It will always take 16 bits, regardless of what the value is. ASCII is 7-bit (8 if you consider the 0 padding/error-checking bit). Thus, the term doesn't fully apply even if the actual value is the same.如果字符仅是 ASCII,您可以将它们称为 ASCII。否则,您应该使用术语 Unicode,正如您所说,它是 ASCII 的适当超集。请记住,即使您将它们称为为 ASCII,但如果您将它们发送到需要真实(基于八位字节的)的内容,则可能需要更改编码 ) ASCII。
如果您的软件仅处理 ASCII 范围内的代码点(见下文,这通常不是一个好主意),那么(对用户或在文档中)说“ASCII 值”比“Unicode 值”要容易得多。 ASCII 范围”:-)
如果您只处理 ASCII 范围内的大写字母,那么在处理大写字母的上下文中将您的值称为 Unicode 代码点实际上是误导。
如今,任何新软件在编写时都应考虑到 Unicode,并且包括大写字母不限于 ASCII 范围这一事实。
例如,有一大块希腊字符靠近具有大写和小写属性的 ASCII 范围。
SpecialCasing.txt
文件显示了这些属性,并且有还有关于该主题的常见问题解答。If the characters will only ever be ASCII, you can refer to them as ASCII. Otherwise, you should use the term Unicode which, as you state, is a proper superset of ASCII. Keep in mind that, even though you refer to them as ASCII, the encoding may need to be changed if you're sending them to something that expects real (octet-based) ASCII.
If you're software only handles code points in the ASCII range (and see below, this is not usually a good idea), it's much easier to say (to users, or in the documentation) "ASCII values" than "Unicode values in the ASCII range" :-)
It's actually misleading to refer to your values as Unicode code points in the context of doing things to uppercase letters, if you only handle the uppercase letters in the ASCII range.
Any new software nowadays should be written with Unicode in mind, and that includes the fact the uppercase letters are not restricted to the ASCII range.
For example, there's a chunk of Greek characters nowhere near the ASCII range that have upper and lowercase properties. The
SpecialCasing.txt
file shows these properties and there's also a FAQ on the subject.根据 Unicode 词汇表,数字代码的正确且正确的术语是其 代码点。例如:
DIGIT ONE
的代码点为 3116 (4910),通常写作 U+0031。POUND SIGN
的代码点是 U+00A3LATIN SMALL LETTER I with DIAERESIS
的代码点是 U+00EF。希腊小写字母 MU
的代码点是 U+03BC。上面带有点的拉丁文小写字母 F
的代码点是 U+1E1F。REPLACMENT CHARACTER
的代码点是 U+FFFD。MUSICAL SYMBOL DOUBLE FLAT
的代码点是 U+1D12B。数学斜体大写字母 R
的代码点是 U+1D445。EXTRATERRESTRIAL ALIEN
的代码点是 U+1F47D。Supplementary_Private_Use_Area_B
块中分配的代码点。拉丁大写字母 A
。ALIEN MONSTER
。等等。
The correct and proper term according the Unicode Glossary for the numeric code is its code point. For example:
DIGIT ONE
is 3116 (4910), normally written U+0031.POUND SIGN
is U+00A3LATIN SMALL LETTER I WITH DIAERESIS
is U+00EF.GREEK SMALL LETTER MU
is U+03BC.LATIN SMALL LETTER F WITH DOT ABOVE
is U+1E1F.REPLACEMENT CHARACTER
is U+FFFD.MUSICAL SYMBOL DOUBLE FLAT
is U+1D12B.MATHEMATICAL ITALIC CAPITAL R
is U+1D445.EXTRATERRESTRIAL ALIEN
is U+1F47D.Supplementary_Private_Use_Area_B
block.LATIN CAPITAL LETTER A
.ALIEN MONSTER
.And so on and so forth.