检查 Unicode 字符串中的空格 - 逐字节检查!
快速&肮脏的问题:我可以安全地假设 UTF-8、UTF-16 或 UTF-32 代码点(字符)的字节将不是 ASCII 空白字符(除非代码点是代表一)?
我会解释一下:
假设我有一个 UTF-8 编码的字符串。该字符串包含一些需要多个字节来存储的字符。我需要查明该字符串中的任何字符是否是 ASCII 空白字符(空格、水平制表符、垂直制表符、回车符、换行符等 - Unicode 定义了更多空白字符,但忘记它们)。
所以我所做的是循环遍历字符串并检查是否有任何字节与定义空白字符的字节匹配。例如,0D(十六进制)表示回车。请注意,我们这里讨论的是字节,而不是字符。
这行得通吗?是否存在第一个字节为 0D 而第二个字节为其他内容的 UTF-8 代码点 - 并且此代码点不代表回车符?也许反过来呢?是否会存在第一个字节很奇怪,第二个(或第三个或第四个)字节为 0D 的代码点 - 并且此代码点不代表回车?
UTF-8 向后兼容 ASCII,所以我真的希望它也适用于 UTF-8。据我所知,可能是这样,但我不太了解细节,无法确定。
至于 UTF-16 和 UTF-32 我怀疑它根本不会工作,但我对它们的细节几乎一无所知,所以请随意给我一个惊喜......
原因奇怪的问题是,我有代码检查适用于 ASCII 的空白,并且我需要知道它是否可能在 Unicode 上中断。由于多种原因,我别无选择,只能逐字节检查。我希望向后兼容 ASCII 至少可以免费支持 UTF-8。
Quick & dirty Q: Can I safely assume that a byte of a UTF-8, UTF-16 or UTF-32 codepoint (character) will not be an ASCII whitespace character (unless the codepoint is representing one)?
I'll explain:
Say that I have a UTF-8 encoded string. This string contains some characters that take more than one byte to store. I need to find out if any of the characters in this string are ASCII whitespace characters (space, horizontal tab, vertical tab, carriage return, linefeed etc - Unicode defines some more whitespace characters, but forget about them).
So what I do is that I loop through the string and check if any of the bytes match the bytes that define whitespace characters. Take e.g. 0D (hex) for carriage return. Note that we are talking bytes here, not characters.
Will this work? Will there be UTF-8 codepoints where the first byte will be 0D and the second byte something else - and this codepoint does not represent a carriage return? Maybe the other way around? Will there be codepoints where the first byte is something weird, and the second (or third, or fourth) byte is 0D - and this codepoint does not represent a carriage return?
UTF-8 is backwards compatible with ASCII, so I really hope that it will work for UTF-8. From what I know of it, it might, but I don't know the details well enough to say for sure.
As for UTF-16 and UTF-32 I doubt it'll work at all, but I barely know anything about the details of these, so feel free to surprise me there...
The reason for this whacky question is that I have code checking for whitespace that works for ASCII, and I need to know if it may break on Unicode. I have no choice but to check byte-for-byte, for a bunch of reasons. I'm hoping that the backwards compatibility with ASCII might give me at least UTF-8 support for free.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于 UTF-8,是的,可以。所有非 ASCII 字符均由设置了高位的字节表示,而所有 ASCII 字符均未设置高位。
需要明确的是,非 ASCII 字符编码中的每个字节都具有高位设置;这是设计使然。
您永远不应该在字节级别对 UTF-16 或 UTF-32 进行操作。这几乎肯定行不通。事实上,很多事情都会被破坏,因为每个第二个字节都可能是
'\0'
(除非您通常使用另一种语言)。For UTF-8, yes, you can. All non-ASCII characters are represented by bytes with the high-bit set and all ASCII characters have the high bit unset.
Just to be clear, every byte in the encoding of a non-ASCII character has the high bit set; this is by design.
You should never operate on UTF-16 or UTF-32 at the byte level. This almost certainly won't work. In fact lots of things will break, since every second byte is likely to be
'\0'
(unless you typically work in another language).在正确编码的UTF-8中,所有ASCII字符将被编码为每个字节,并且每个字节的数值将等于Unicode和ASCII代码点。此外,任何非 ASCII 字符都将仅使用设置了第八位的字节进行编码。因此,字节值 0D 将始终表示回车符,而不是多字节 UTF-8 序列的第二个或第三个字节。
然而,有时UTF-8解码规则被滥用以其他方式存储ASCII字符。例如,如果您采用两字节序列 C0 A0 并对其进行 UTF-8 解码,您将得到一字节值 20,这是一个空格。 (任何时候你找到字节 C0 或 C8,它都是 ASCII 字符的两字节编码的第一个字节。)我见过这样做是为了对最初假定为单个单词的字符串进行编码,但后来的需求增长到允许值有空格。为了不破坏现有代码(使用
strtok
和sscanf
等内容来识别空格分隔字段),该值是使用这种混蛋的 UTF-8 而不是真实的编码进行编码的UTF-8。不过,您可能不需要担心这一点。如果程序的输入使用该格式,那么您的代码可能无论如何都不会检测该点上特殊编码的空白,因此您可以安全地忽略它。
In correctly encoded UTF-8, all ASCII characters will be encoded as one byte each, and the numeric value of each byte will be equal to the Unicode and ASCII code points. Furthermore, any non-ASCII character will be encoded using only bytes that have the eighth bit set. Therefore, a byte value of 0D will always represent a carriage return, never the second or third byte of a multibyte UTF-8 sequence.
However, sometimes the UTF-8 decoding rules are abused to store ASCII characters in other ways. For example, if you take the two-byte sequence C0 A0 and UTF-8-decode it, you get the one-byte value 20, which is a space. (Any time you find the byte C0 or C8, it's the first byte of a two-byte encoding of an ASCII character.) I've seen this done to encode strings that were originally assumed to be single words, but later requirements grew to allow the value to have spaces. In order to not break existing code (which used stuff like
strtok
andsscanf
to recognize space-delimited fields), the value was encoded using this bastardized UTF-8 instead of real UTF-8.You probably don't need to worry about that, though. If the input to your program uses that format, then your code probably isn't meant to detect the specially encoded whitespace at that point anyway, so it's safe for you to ignore it.
是的,但是请参阅下面有关以这种方式处理非字节流的陷阱的警告。
对于 UTF-8,任何连续字节总是以位
10
开头,使它们大于0x7f
,不,它们不可能被弄错为 ASCII 空格。您可以在下表中看到这一点:
您还可以看到 ASCII 范围之外的代码点的非连续字节也设置了高位,因此它们也永远不会被误认为是空格。
有关更多详细信息,请参阅wikipedia UTF-8。
首先不应逐字节处理 UTF-16 和 UTF-32。您应该始终处理单位本身,无论是 16 位还是 32 位值。如果您这样做,您也将受到保护。如果您逐字节处理这些字节,您可能会发现
0x20
字节不是空格(例如,16 位 UTF-16 值的第二个字节)。对于 UTF-16,由于该编码中的扩展字符是由代理对形成的,代理对的各个值在
0xd800
到0xdfff
范围内,因此这些代理对不会有危险组件也可能被误认为是空格。有关更多详细信息,请参阅wikipedia UTF-16。
最后,UTF-32(维基百科链接)足够大,可以表示所有 Unicode代码点,因此不需要特殊的编码。
Yes, but see caveat below about the pitfalls of processing non-byte-oriented streams in this way.
For UTF-8, any continuation bytes always start with the bits
10
, making them greater than0x7f
, no there's no chance they could be mistaken for a ASCII space.You can see this in the following table:
You can also see that the non-continuation bytes for code points outside the ASCII range also have the high bit set, so they can never be mistaken for a space either.
See wikipedia UTF-8 for more detail.
UTF-16 and UTF-32 shouldn't be processed byte-by-byte in the first place. You should always process the unit itself, either a 16-bit or 32-bit value. If you do that, you're covered as well. If you process these byte-by-byte, there is a danger you'll find a
0x20
byte that is not a space (e.g., the second byte of a 16-bit UTF-16 value).For UTF-16, since the extended characters in that encoding are formed from a surrogate pair whose individual values are in the range
0xd800
through0xdfff
, there's no danger that these surrogate pair components could be mistaken for spaces either.See wikipedia UTF-16 for more detail.
Finally, UTF-32 (wikipedia link here) is big enough to represent all of the Unicode code points so no special encoding is required.
强烈建议在处理 Unicode 时不要针对字节进行操作。两个主要平台(Java 和 .Net)本身支持 unicode,并且还提供了一种确定此类事物的机制。例如,在Java中,您可以针对您的用例使用Character类的isSpace()/isSpaceChar()/isWhitespace()方法。
It is strongly suggested not to work against bytes when dealing with Unicode. The two major platforms (Java and .Net) support unicode natively and also provide a mechanism for determining these kind of things. For e.g. In Java you can use Character class's isSpace()/isSpaceChar()/isWhitespace() methods for your use case.