为什么 8 和 256 在计算机科学中如此重要?
我不太了解 RAM 和 HDD 架构,或者电子设备如何处理内存块,但这总是引发我的好奇心: 为什么我们选择将计算机值中的最小元素停在 8 位?
我的问题可能看起来很愚蠢,因为答案很明显,但我不太确定......
是因为 2^3 允许它在寻址内存时完美适合吗? 电子设备是否专门设计用于存储 8 位数据块?如果是的话,为什么不使用更广泛的词呢? 是因为它除以 32、64 和 128,这样处理器字就可以被赋予其中的几个字? 这么小的空间,256值方便吗?
你怎么认为 ?
我的问题有点太形而上学,但我想确保这只是一个历史原因,而不是技术或数学原因。
对于轶事,我也在考虑 ASCII 标准,其中大多数第一个字符对于 UTF-8 之类的东西来说是无用的,我也在尝试考虑一些更小、更快的字符编码......
I don't know very well about RAM and HDD architecture, or how electronics deals with chunks of memory, but this always triggered my curiosity:
Why did we choose to stop at 8 bits for the smallest element in a computer value ?
My question may look very dumb, because the answer are obvious, but I'm not very sure...
Is it because 2^3 allows it to fit perfectly when addressing memory ?
Are electronics especially designed to store chunk of 8 bits ? If yes, why not use wider words ?
It is because it divides 32, 64 and 128, so that processor words can be be given several of those words ?
Is it just convenient to have 256 value for such a tiny space ?
What do you think ?
My question is a little too metaphysical, but I want to make sure it's just an historical reason and not a technological or mathematical reason.
For the anecdote, I was also thinking about the ASCII standard, in which most of the first characters are useless with stuff like UTF-8, I'm also trying to think about some tinier and faster character encoding...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
从历史上看,字节的大小并不总是 8 位(就此而言,计算机也不必是二进制的,但非二进制计算在实践中的作用要少得多)。正是由于这个原因,IETF 和 ISO 标准经常使用术语八位字节 - 他们不使用字节,因为他们不想假设它意味着 8 位事实并非如此。
事实上,当字节被创造出来时,它就被定义为1-6位的单位。历史上使用的字节大小包括 7、9、36 以及具有可变大小字节的机器。
8 是商业成功的混合体,对于考虑它的人们来说这是一个足够方便的数字(这会互相影响),毫无疑问还有我完全不知道的其他原因。
您提到的 ASCII 标准假定为 7 位字节,并且基于早期的 6 位通信标准。
编辑:可能值得补充这一点,因为有些人坚持认为那些说字节总是八位字节的人将字节与单词混淆了。
八位组是 8 位单位的名称(来自拉丁语,意为“八”)。如果您使用的是字节为 8 位的计算机(或更高抽象级别的编程语言),那么这很容易做到,否则您需要一些转换代码(或硬件中的转换)。八位字节的概念更多地出现在网络标准中,而不是本地计算中,因为它与体系结构无关,它允许创建可用于不同字节大小的机器之间通信的标准,因此它的在 IETF 和 ISO 标准中使用(顺便说一句,ISO/IEC 10646 使用 octet,而 Unicode 标准使用 byte 来表示本质上的内容 - 对后一部分有一些小的额外限制 -相同的标准,尽管 Unicode 标准详细说明它们意味着 八位组 by 字节,即使字节在不同的机器上可能有不同的大小)。 八位字节的概念之所以存在,正是因为 8 位字节很常见(因此选择使用它们作为此类标准的基础),但并不通用(因此需要另一个词来避免歧义)。
从历史上看,字节是用于存储字符的大小,而这一问题反过来又建立在实践、标准和事实上的标准之上,这些标准早于用于电传和其他通信方法的计算机,可能从 1870 年的 Baudot 开始(我不知道)之前不知道,但我愿意接受更正)。
这反映在以下事实:在 C 和 C++ 中,存储字节的单位称为
char
,其大小(以位为单位)由标准 Limits.h 标头中的CHAR_BIT
定义。不同的机器会使用 5、6、7、8、9 或更多位来定义字符。当然,如今我们将字符定义为 21 位,并使用不同的编码以 8 位、16 位或 32 位单位存储它们(以及其他大小的非 Unicode 授权方式,如 UTF-7),但从历史上看,这是就是这样。在旨在跨机器更加一致而不是反映机器架构的语言中,
字节
往往在语言中是固定的,现在这通常意味着它在语言中定义为 8 位。考虑到它们被创建的历史点,以及大多数机器现在都有 8 位字节,这种区别在很大程度上没有实际意义,尽管在不同大小的机器上为此类语言实现编译器、运行时等并非不可能。字节,只是没那么容易。单词是给定计算机的“自然”大小。这一点的定义不太明确,因为它会影响一些通常会同时发生但也可能不会发生的重叠问题。机器上的大多数寄存器都是这个大小,但有些可能不是。最大地址大小通常是一个字,尽管情况可能并非如此(Z80 有一个 8 位字节和一个 1 字节字,但允许寄存器加倍以提供一些 16 位支持,包括 16 位寻址)。
我们再次看到 C 和 C++ 之间的区别,其中
int
是根据字大小定义的,而long
则被定义为利用具有“长字”的处理器。 " 概念应该存在,尽管在给定情况下可能与int
相同。最小值和最大值再次位于 Limits.h 标头中。 (事实上,随着时间的推移,int
可能被定义为小于自然字长,作为与其他地方常见的一致性的组合,减少整数数组的内存使用,可能还有我不知道的其他问题)。Java 和 .NET 语言采用在所有架构中固定定义
int
和long
的方法,并使处理差异成为运行时(特别是 JITter)的问题处理。但值得注意的是,即使在 .NET 中,指针的大小(在不安全代码中)也会根据基础字大小的体系结构而变化,而不是语言强加的字大小。因此,八位位组、字节和字彼此非常独立,尽管八位位组 == 字节和字是整数个字节(以及整个二进制整数,如 2、4、8 等)的关系很常见今天。
Historically, bytes haven't always been 8-bit in size (for that matter, computers don't have to be binary either, but non-binary computing has seen much less action in practice). It is for this reason that IETF and ISO standards often use the term octet - they don't use byte because they don't want to assume it means 8-bits when it doesn't.
Indeed, when byte was coined it was defined as a 1-6 bit unit. Byte-sizes in use throughout history include 7, 9, 36 and machines with variable-sized bytes.
8 was a mixture of commercial success, it being a convenient enough number for the people thinking about it (which would have fed into each other) and no doubt other reasons I'm completely ignorant of.
The ASCII standard you mention assumes a 7-bit byte, and was based on earlier 6-bit communication standards.
Edit: It may be worth adding to this, as some are insisting that those saying bytes are always octets, are confusing bytes with words.
An octet is a name given to a unit of 8 bits (from the Latin for eight). If you are using a computer (or at a higher abstraction level, a programming language) where bytes are 8-bit, then this is easy to do, otherwise you need some conversion code (or coversion in hardware). The concept of octet comes up more in networking standards than in local computing, because in being architecture-neutral it allows for the creation of standards that can be used in communicating between machines with different byte sizes, hence its use in IETF and ISO standards (incidentally, ISO/IEC 10646 uses octet where the Unicode Standard uses byte for what is essentially - with some minor extra restrictions on the latter part - the same standard, though the Unicode Standard does detail that they mean octet by byte even though bytes may be different sizes on different machines). The concept of octet exists precisely because 8-bit bytes are common (hence the choice of using them as the basis of such standards) but not universal (hence the need for another word to avoid ambiguity).
Historically, a byte was the size used to store a character, a matter which in turn builds on practices, standards and de-facto standards which pre-date computers used for telex and other communication methods, starting perhaps with Baudot in 1870 (I don't know of any earlier, but am open to corrections).
This is reflected by the fact that in C and C++ the unit for storing a byte is called
char
whose size in bits is defined byCHAR_BIT
in the standard limits.h header. Different machines would use 5,6,7,8,9 or more bits to define a character. These days of course we define characters as 21-bit and use different encodings to store them in 8-, 16- or 32-bit units, (and non-Unicode authorised ways like UTF-7 for other sizes) but historically that was the way it was.In languages which aim to be more consistent across machines, rather than reflecting the machine architecture,
byte
tends to be fixed in the language, and these days this generally means it is defined in the language as 8-bit. Given the point in history when they were made, and that most machines now have 8-bit bytes, the distinction is largely moot, though it's not impossible to implement a compiler, run-time, etc. for such languages on machines with different sized bytes, just not as easy.A word is the "natural" size for a given computer. This is less clearly defined, because it affects a few overlapping concerns that would generally coïncide, but might not. Most registers on a machine will be this size, but some might not. The largest address size would typically be a word, though this may not be the case (the Z80 had an 8-bit byte and a 1-byte word, but allowed some doubling of registers to give some 16-bit support including 16-bit addressing).
Again we see here a difference between C and C++ where
int
is defined in terms of word-size andlong
being defined to take advantage of a processor which has a "long word" concept should such exist, though possibly being identical in a given case toint
. The minimum and maximum values are again in the limits.h header. (Indeed, as time has gone on,int
may be defined as smaller than the natural word-size, as a combination of consistency with what is common elsewhere, reduction in memory usage for an array of ints, and probably other concerns I don't know of).Java and .NET languages take the approach of defining
int
andlong
as fixed across all architecutres, and making dealing with the differences an issue for the runtime (particularly the JITter) to deal with. Notably though, even in .NET the size of a pointer (in unsafe code) will vary depending on architecture to be the underlying word size, rather than a language-imposed word size.Hence, octet, byte and word are all very independent of each other, despite the relationship of octet == byte and word being a whole number of bytes (and a whole binary-round number like 2, 4, 8 etc.) being common today.
并非所有字节都是 8 位。有些是 7,有些是 9,有些完全是其他值。 8 之所以重要,是因为在大多数现代计算机中,它是一个字节中的标准位数。正如尼古拉提到的,位是实际的最小单位(单个二进制值,真或假)。
正如 Will 提到的,这篇文章 http://en.wikipedia.org/wiki/Byte 描述了更详细地了解字节及其可变大小的历史记录。
8、256 和其他数字之所以重要的一般原因是它们是 2 的幂,并且计算机使用基于 2(二进制)的交换机系统运行。
Not all bytes are 8 bits. Some are 7, some 9, some other values entirely. The reason 8 is important is that, in most modern computers, it is the standard number of bits in a byte. As Nikola mentioned, a bit is the actual smallest unit (a single binary value, true or false).
As Will mentioned, this article http://en.wikipedia.org/wiki/Byte describes the byte and its variable-sized history in some more detail.
The general reasoning behind why 8, 256, and other numbers are important is that they are powers of 2, and computers run using a base-2 (binary) system of switches.
ASCII 编码需要 7 位,EBCDIC 需要 8 位。扩展 ASCII 代码(例如 ANSI 字符集)使用第 8 位来扩展带有图形、重音字符和其他符号的字符集。一些体系结构使用专有编码; DEC PDP-10 就是一个很好的例子,它具有 36 位机器字。此架构上的一些操作系统使用打包编码,在一个机器字中存储 6 个字符,用于各种目的(例如文件名)。
到了 1970 年代,DG Nova 和 DEC PDP-11(16 位架构)和具有 32 位机器字的 IBM 大型机的成功推动业界默认使用 8 位字符。 20 世纪 70 年代末的 8 位微处理器就是在这种环境下开发的,这成为事实上的标准,特别是当现成的外围设备(例如 UART、ROM 芯片和 FDC 芯片)被构建为 8 位设备时。
到 20 世纪 70 年代后期,业界已将 8 位确定为事实上的标准,而具有 12 位机器字的 PDP-8 等架构在某种程度上被边缘化(尽管 PDP-8 ISA 及其衍生产品仍然出现在嵌入式系统产品中) )。随后出现了 16 位和 32 位微处理器设计,例如 Intel 80x86 和 MC68K 系列。
ASCII encoding required 7 bits, and EBCDIC required 8 bits. Extended ASCII codes (such as ANSI character sets) used the 8th bit to expand the character set with graphics, accented characters and other symbols.Some architectures made use of proprietary encodings; a good example of this is the DEC PDP-10, which had a 36 bit machine word. Some operating sytems on this architecture used packed encodings that stored 6 characters in a machine word for various purposes such as file names.
By the 1970s, the success of the D.G. Nova and DEC PDP-11, which were 16 bit architectures and IBM mainframes with 32 bit machine words was pushing the industry towards an 8 bit character by default. The 8 bit microprocessors of the late 1970s were developed in this environment and this became a de facto standard, particularly as off-the shelf peripheral ships such as UARTs, ROM chips and FDC chips were being built as 8 bit devices.
By the latter part of the 1970s the industry settled on 8 bits as a de facto standard and architectures such as the PDP-8 with its 12 bit machine word became somewhat marginalised (although the PDP-8 ISA and derivatives still appear in embedded sytem products). 16 and 32 bit microprocessor designs such as the Intel 80x86 and MC68K families followed.
由于计算机使用二进制数,因此 2 的所有幂都很重要。
8 位数字能够表示 256 (2^8) 个不同的值,足以表示所有英语字符和相当多的额外字符。这使得数字 8 和 256 变得非常重要。
事实上,许多 CPU(过去和现在仍然)以 8 位处理数据有很大帮助。
您可能听说过的其他重要的 2 的幂是 1024 (2^10=1k) 和 65536 (2^16=65k)。
Since computers work with binary numbers, all powers of two are important.
8bit numbers are able to represent 256 (2^8) distinct values, enough for all characters of English and quite a few extra ones. That made the numbers 8 and 256 quite important.
The fact that many CPUs (used to and still do) process data in 8bit helped a lot.
Other important powers of two you might have heard about are 1024 (2^10=1k) and 65536 (2^16=65k).
计算机建立在数字电子学的基础上,而数字电子学与国家合作。一个片段可以有 2 种状态,1 或 0(如果电压高于某个电平,则为 1,否则为 0)。为了表示这种行为,二进制系统被引入(虽然没有引入,但被广泛接受)。
那么我们就进入正题了。位是二进制系统中最小的碎片。它只能有 2 种状态,1 或 0,它代表整个系统的原子片段。
为了让我们的生活更轻松,引入了字节(8 位)。给你一些类比,我们不以克来表示重量,但这是重量的基本度量,但我们使用千克,因为它更容易使用和理解用途。一公斤是 1000 克,可以表示为 10 的 3 次方。因此,当我们回到二进制系统并使用相同的次方时,我们得到 8(2 的 3 次方是 8)。这样做是因为在日常计算中仅使用位过于复杂。
这种情况一直持续着,所以在未来,当我们意识到 8 字节再次太小并且使用起来变得复杂时,我们在幂上添加 +1(2 的 4 次方是 16),然后再次 2^5 是 32 ,依此类推,256 只是 2 的 8 次方。
所以你的答案是,由于计算机的体系结构,我们遵循二进制系统,并且我们增加幂的值来表示得到一些值我们可以简单地处理每一天,这就是如何从一个位到一个字节(8 位)等等!
(2, 4, 8, 16, 32, 64 、 128、256、512、1024 等)(2^x、x=1,2,3,4,5,6,7,8, 9,10 等)
Computers are build upon digital electronics, and digital electronics works with states. One fragment can have 2 states, 1 or 0 (if the voltage is above some level then it is 1, if not then it is zero). To represent that behavior binary system was introduced (well not introduced but widely accepted).
So we come to the bit. Bit is the smallest fragment in binary system. It can take only 2 states, 1 or 0, and it represents the atomic fragment of the whole system.
To make our lives easy the byte (8 bits) was introduced. To give u some analogy we don't express weight in grams, but that is the base measure of weight, but we use kilograms, because it is easier to use and to understand the use. One kilogram is the 1000 grams, and that can be expressed as 10 on the power of 3. So when we go back to the binary system and we use the same power we get 8 ( 2 on the power of 3 is 8). That was done because the use of only bits was overly complicated in every day computing.
That held on, so further in the future when we realized that 8 bytes was again too small and becoming complicated to use we added +1 on the power ( 2 on the power of 4 is 16), and then again 2^5 is 32, and so on and the 256 is just 2 on the power of 8.
So your answer is we follow the binary system because of architecture of computers, and we go up in the value of the power to represent get some values that we can simply handle every day, and that is how you got from a bit to an byte (8 bits) and so on!
(2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, and so on) (2^x, x=1,2,3,4,5,6,7,8,9,10 and so on)
这里重要的数字是二进制
0
或1
。你所有的其他问题都与此相关。克劳德香农和乔治·布尔在我们现在所说的信息论和布尔算术方面做了基础工作。简而言之,这就是数字开关如何仅能够表示
0 OFF
和1 ON
来表示更复杂的信息,例如数字、逻辑和信息的基础。一张 jpg 照片。二进制是我们目前所知的计算机的基础,但其他数字计算机或模拟计算机也是完全可能的。在人类的十进制算术中,十的幂具有重要意义。 10、100、1000、10,000 个似乎都很重要且有用。一旦你拥有了一台基于二进制的计算机,2 的幂同样变得很重要。 2^8 = 256 足以容纳字母表、标点符号和控制字符。 (更重要的是,2^7 足以容纳字母表、标点符号和控制字符,2^8 足以容纳这些 ASCII 字符和 检查位。)
The important number here is binary
0
or1
. All your other questions are related to this.Claude Shannon and George Boole did the fundamental work on what we now call information theory and Boolean arithmetic. In short, this is the basis of how a digital switch, with only the ability to represent
0 OFF
and1 ON
can represent more complex information, such as numbers, logic and a jpg photo. Binary is the basis of computers as we know them currently, but other number base computers or analog computers are completely possible.In human decimal arithmetic, the powers of ten have significance. 10, 100, 1000, 10,000 each seem important and useful. Once you have a computer based on binary, there are powers of 2, likewise, that become important. 2^8 = 256 is enough for an alphabet, punctuation and control characters. (More importantly, 2^7 is enough for an alphabet, punctuation and control characters and 2^8 is enough room for those ASCII characters and a check bit.)
我们通常以 10 为基数进行计数,一个数字可以有十个不同值之一。计算机技术基于可以打开或关闭的开关(微观)。如果其中一个代表数字,则该数字可以是 1 或 0。这是基数 2。
由此可见,计算机处理的数字是由一系列 2 值数字组成的。
设计处理器时,必须选择处理器将优化使用的大小。对于CPU来说,这被认为是一个“字”。早期的 CPU 基于四位字长,不久之后又基于 8 位(1 字节)字长。如今,CPU 主要设计为运行 32 位和 64 位字。但事实上,两种状态的“切换”是所有计算机数字都趋向于 2 的幂的原因。
We normally count in base 10, a single digit can have one of ten different values. Computer technology is based on switches (microscopic) which can be either on or off. If one of these represents a digit, that digit can be either 1 or 0. This is base 2.
It follows from there that computers work with numbers that are built up as a series of 2 value digits.
When processors are designed, they have to pick a size that the processor will be optimized to work with. To the CPU, this is considered a "word". Earlier CPUs were based on word sizes of fourbits and soon after 8 bits (1 byte). Today, CPUs are mostly designed to operate on 32 bit and 64 bit words. But really, the two state "switch" are why all computer numbers tend to be powers of 2.
我认为主要原因与IBM PC的原始设计有关。 Intel 8080 CPU 是8086 的第一个前身,后来成为用于 IBM PC。它有 8 位寄存器。因此,整个应用程序生态系统是围绕 8 位隐喻开发的。为了保留向后兼容性,Intel设计所有后续架构都保留8位寄存器。因此,8086 和之后的所有 x86 CPU 保留了 8 位寄存器以实现向后兼容性,尽管多年来它们添加了新的 16 位和 32 位寄存器。
我能想到的另一个原因是 8 位非常适合基本的拉丁字符集。您无法将其放入 4 位,但可以放入 8 位。因此,您将获得整个 256 值 ASCII 字符集。它也是 2 的最小幂,您可以在其中容纳足够的位来容纳字符集。当然,现在大多数字符集实际上都是 16 位宽(即 Unicode)。
I believe the main reason has to do with the original design of the IBM PC. The Intel 8080 CPU was the first precursor to the 8086 which would later be used in the IBM PC. It had 8-bit registers. Thus, a whole ecosystem of applications was developed around the 8-bit metaphor. In order to retain backward compatibility, Intel designed all subsequent architectures to retain 8-bit registers. Thus, the 8086 and all x86 CPUs after that kept their 8-bit registers for backwards compatibility, even though they added new 16-bit and 32-bit registers over the years.
The other reason I can think of is 8 bits is perfect for fitting a basic Latin character set. You cannot fit it into 4 bits, but you can in 8. Thus, you get the whole 256-value ASCII charset. It is also the smallest power of 2 for which you have enough bits into which you can fit a character set. Of course, these days most character sets are actually 16-bit wide (i.e. Unicode).
Charles Petzold 写了一本有趣的书,名为 Code 恰好涵盖了这个问题。请参阅第 15 章“字节和十六进制”。
该章的引述:
Charles Petzold wrote an interesting book called Code that covers exactly this question. See chapter 15, Bytes and Hex.
Quotes from that chapter:
我想是历史原因吧。 8 是 2 的幂,2^2 是 4,2^4 = 16 对于大多数用途来说太小了,16(下一个 2 的幂)位硬件出现得更晚。
但我怀疑主要原因是他们有 8 位微处理器,然后是 16 位微处理器,其字可以很好地表示为 2 个八位位组,等等。你知道,历史缺陷和向后兼容性等。
反对“缩小规模”的另一个类似的实用原因是:如果我们使用 4 位作为一个字,那么与 8 位相比,我们基本上只能得到一半的吞吐量。除了溢出速度快得多之外。
您始终可以将例如 0..15 范围内的 2 个数字压缩到一个八位组中...您只需手动提取它们即可。但除非你有无数的数据集并排保存在内存中,否则这是不值得的。
Historical reasons, I suppose. 8 is a power of 2, 2^2 is 4 and 2^4 = 16 is far too little for most purposes, and 16 (the next power of two) bit hardware came much later.
But the main reason, I suspect, is the fact that they had 8 bit microprocessors, then 16 bit microprocessors, whose words could very well be represented as 2 octets, and so on. You know, historical cruft and backward compability etc.
Another, similarily pragmatic reason against "scaling down": If we'd, say, use 4 bits as one word, we would basically get only half the troughtput compared with 8 bit. Aside from overflowing much faster.
You can always squeeze e.g. 2 numbers in the range 0..15 in one octet... you just have to extract them by hand. But unless you have, like, gazillions of data sets to keep in memory side-by-side, this isn't worth the effort.