PHP iconv_strlen() 含义问题
我想知道下面这句话对于我们这些傻瓜来说简单来说意味着什么?
什么是字节序列?一个字节有多少个字符?
iconv_strlen()根据指定的字符集统计给定字节序列str中字符的出现次数,其结果不一定与字符串的字节长度相同。
I was wondering what does the following sentence mean in simple terms for us dummies?
And what is byte sequence? And how many characters in a byte?
iconv_strlen() counts the occurrences of characters in the given byte sequence str on the basis of the specified character set, the result of which is not necessarily identical to the length of the string in byte.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我们以日语字符“こ”为例。假设采用 UTF-8 编码,这是一个 3 字节字符 (0xE3 0x81 0x93)。让我们看看当我们使用
strlen
时会发生什么:结果是 3,因为
strlen
正在计算字节。然而,根据 UTF-8 编码,这只是单个字符。这就是iconv_strlen
的用武之地。它知道在 UTF-8 中,这是单个字符,即使它由 3 个字节组成。因此,如果我们尝试这样做:我们得到 1。这就是该解释要指出的内容。
Let's take for example the Japanese character 'こ'. Assuming UTF-8 encoding, this is a 3 byte character (0xE3 0x81 0x93). Let's see what happens when we use
strlen
instead:The result is 3, since
strlen
is counting bytes. However, this is only a single character according to UTF-8 encoding. That's whereiconv_strlen
comes in. It knows that in UTF-8, this is a single character, even though it's made up of 3 bytes. So if we try this instead:We get 1. That's what that explanation is meant to point out.
“每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)”
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
字符串具有特定的字节长度。当且仅当字符串中的每个字符都由单个字节表示时,该字符串中的字符数将等于字节数。例如,对于英文字母来说就是如此。对于使用多个字节来表示部分或全部字符的表示(即编码),字符数将小于字节数*。例如,不可能用一个字节来表示所有可能的汉字。
因此,给定编码的 iconv_strlen 将尝试计算字符串中的字符数。字节序列是字符串中字节的顺序。例如,对于包含中文的字符串,使用 UTF8 编码,您可能有一个包含 14 个字符的 20 字节字符串。
*如果一个字符由少于一个字节表示,则可能会更多。
A string has a particular length in bytes. The number of characters in that string will be equal to the number of bytes if and only if each character in the string is represented by a single byte. This is true, for example, for English letters. For representations (i.e., encodings) that use more than one byte to represent some or all characters, the number of characters will be less than the number of bytes*. It is not possible, for example, to represent all possible Chinese characters with a byte.
So, iconv_strlen, given an encoding, will try to count the number of characters in the string. The byte sequence is the order of bytes in the string. For a string containing Chinese, using UTF8 encoding, you might, for example, have a 20-byte string that has 14 characters.
*It could be more, if a character is represented by less than one byte.
翻译:
字节序列
:字符串的另一个词,它是字节序列(1字节= 8位),例如:01011010 00011001 01101011
。字节序列代表字符,例如A
、B
、C
等。字符集
:又名编码,指定字节如何映射到字符;例如01000001
代表ASCII字符集中的A
。不一定与字节长度[...]相同
:在 ASCII 字符集中,一个字节恰好代表一个字符。并非所有字符集都是如此;有的用两个、三个或更多字节来表示一个字符。这是因为 1 个字节只能容纳 256 个不同的值,而某些语言是使用超过 256 个字符编写的(例如中文和日文)。 Unicode 甚至尝试将所有人类语言的所有字符映射到一个字符集中,这需要每个字符多于一个字节。总之:
Translations:
byte sequence
: another word for string, which is a sequence of bytes (1 byte = 8 bits), e.g.:01011010 00011001 01101011
. Byte sequences represent characters likeA
,B
,C
etc.character set
: a.k.a. encoding, specifies how a byte maps to a character; e.g.01000001
representsA
in the ASCII character set.not necessarily identical to the length […] in byte
: in the ASCII character set, one byte represents exactly one character. This is not the case for all character sets; in some two, three or more bytes are used to represent one character. That is because one byte can only hold 256 different values and some languages are written using more than 256 characters (like Chinese and Japanese). Unicode even attempts to map all characters of all human languages in a single character set, which requires a lot more than one byte per character.In summary: