UTF-8 字符串分隔符
我正在解析一个二进制协议,其中 UTF-8 字符串散布在原始字节中。此特定协议在每个 UTF-8 字符串前面添加一个短字符(两个字节),指示后续 UTF-8 字符串的长度。这给出了最大字符串长度 2^16 > 65 000 对于特定应用来说绰绰有余。
我的问题是,这是分隔 UTF-8 字符串的标准方法吗?
I am parsing a binary protocol which has UTF-8 strings interspersed among raw bytes. This particular protocol prefaces each UTF-8 string with a short (two bytes) indicating the length of the following UTF-8 string. This gives a maximum string length 2^16 > 65 000 which is more than adequate for the particular application.
My question is, is this a standard way of delimiting UTF-8 strings?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不会称之为定界,更像是“长度前缀”。有些人称它们为 Pascal 字符串,因为早期 Pascal 语言是以这种方式在内存中存储字符串的流行方法之一。
我认为没有专门针对这一点的正式标准,因为它是存储 UTF-8 字符串(或任何与此相关的字节字符串)的相当明显的方式。不过,它被反复定义为许多处理包含字符串的消息的标准的一部分。
I wouldn't call that delimiting, more like "length prefixing". Some people call them Pascal strings since in the early days the language Pascal was one of the popular ones that stored strings that way in memory.
I don't think there's a formal standard specifically for just that, as it's a rather obvious way of storing UTF-8 strings (or any strings of bytes for that matter). It's defined over and over as a part of many standards that deal with messages that contain strings, though.
UTF8 通常不会被限制,您应该能够使用此处提到的规则来发现其中的多字节字符:http://en.wikipedia.org/wiki/UTF-8#Description
UTF8 is not normally de-limited, you should be able to spot the multibyte characters in there by using the rules mentioned here: http://en.wikipedia.org/wiki/UTF-8#Description
我会使用以 0x11 开头的分隔符......
但如果您发送原始字节,则必须从处理的数据\消息中排除此分隔符,这意味着如果存在与该分隔符类似的用户输入,则必须将其转换。
如果用户输入任何 utf8 表示的字符,您只需按原样发送即可。
i would use a delimiter which starts with 0x11......
but if you send raw bytes you will have to exclude this delimiter from the data\messages processed ,this means that if there is a user input similar to that delimiter, you will have to convert it.
if the user inputs any utf8 represented char you may simply send it as is.