如何在普通 C89 中读取给定字符长度的 UTF-8 字符串?

发布于 2024-10-29 08:22:30 字数 777 浏览 8 评论 0原文

我正在用纯 C89 编写一个自定义的跨平台简约 TCP 服务器。 (但我也会接受特定于 POSIX 的答案。)

服务器使用 UTF-8 字符串,但从不查看它们的内部。它将所有字符串视为不可变的二进制 blob。

但现在我需要接受来自客户端的 UTF-8 字符串,该字符串不知道如何计算其大小(以字节为单位)。客户端只能传输以字符为单位的字符串长度。 (更新:客户端使用 JavaScript,“字符长度”实际上是 String.length() 返回的内容。我认为它是实际的 UTF-8 字符,而不是其他字符否则。)

我不想向我的小型服务器添加严重的依赖项。有没有一种强大而简洁的方法来读取这个数据报? (为了解决这个问题,假设它是从 FILE * 读取的。)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

示例:

U
7
Юникод!

更新:

一批数据可以包含多个数据报,因此近似读取不起作用,我需要读取确切数量的字符。

实际的 UTF-8 数据可能包含任何字符,因此我无法选择一个字符作为终止符 - 我不想在数据中对其进行转义。

I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)

The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.

But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length() returns. I assume it is actual UTF-8 characters, not something else.)

I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *.)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

Example:

U
7
Юникод!

Update:

One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.

And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

一抹苦笑 2024-11-05 08:22:30

根据此处的信息,编写 UTF-8“阅读器”非常容易; UTF-8 的设计初衷就是让像这样的任务变得很容易。

本质上,你开始阅读字符,直到你读到客户告诉你的数量。您知道您已经根据 UTF-8 编码定义读取了整个字符,具体来说:

如果字符仅由编码
一个字节,高位为0,
其他位给出代码值(在
范围 0..127)。如果角色是
由多个序列编码
字节,第一个字节有那么多
前导“1”位作为总数
序列中的字节数,后跟
'0' 位和后续字节
全部由前导“10”位标记
模式。

It's pretty easy to write a UTF-8 "reader" given the information here; UTF-8 was designed so tasks like this one would be easy.

In essence, you start reading characters until you read as many as the client tells you. You know that you 've read a whole character given the UTF-8 encoding definition, specifically:

If the character is encoded by just
one byte, the high-order bit is 0 and
the other bits give the code value (in
the range 0..127). If the character is
encoded by a sequence of more than one
byte, the first byte has as many
leading '1' bits as the total number
of bytes in the sequence, followed by
a '0' bit, and the succeeding bytes
are all marked by a leading "10" bit
pattern.

海拔太高太耀眼 2024-11-05 08:22:30

好吧,JavaScript字符串的长度属性似乎计算了Codepoint,而不是字符,如您所见(但请等待!这不是CodePoints):

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

尽管这是V8。环顾四周,似乎这实际上是eCmascript标准所需的:

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-string-length-length-and-intergth-and-internation-web-applications

,也检查ECMA-262,on PDF的第40-41页说:“字符串的长度是其中的元素数(即16位值)”,然后继续清楚地表明元素是UTF-16单位。遗憾的是,这不完全是“代码点”。基本上,这使字符串长度属性变得毫无用处。环顾四周,我发现了这一点:

我告诉字符串是否包含javaScript中的多重字符?

Well, the length property of JavaScript strings seems to count codepoints, not characters, as you can see (but wait! it's not quite codepoints):

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

Although that's with V8. Looking around it seems that that's actually what the ECMAScript standard requires:

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-length-and-internationalizing-web-applications

Also, checking ECMA-262, on pages 40-41 of the PDF it says "The length of a String is the number of elements (i.e., 16-bit values) within it", and then goes on to make clear that the elements are UTF-16 units. Sadly that's not quite "codepoints". Basically, this makes the string length property rather useless. Looking around I find this:

How can I tell if a string contains multibyte characters in Javascript?

爱情眠于流年 2024-11-05 08:22:30

人物?或者代码点?两者并不相同。 Unicode 很……复杂。您可以计算有关 UTF-8 字符串的所有这些不同的内容:以字节为单位的长度、以代码点为单位的长度、以字符为单位的长度、以字形为单位的长度以及以字素簇为单位的长度。对于任何给定的字符串,所有这些结果可能会有所不同!

我的第一反应是让那个心碎的客户走开。但假设你不能这样做,你需要询问客户到底在数什么。除了字节之外,最简单的计数就是代码点——毕竟,这就是 UTF-8 编码的内容。在那之后?字符,但您需要有组成代码点的表,以便可以识别组成字符的代码点序列。如果客户计算字形或字素簇,那么您就会陷入痛苦的境地。但很可能客户端会计算代码点或字符。如果它计算代码点,则只需使用二进制值 10xxxxxx 和 0xxxxxxx 计算字节(尽管您可能希望实现足够的 UTF-8 以防止序列过长)。如果它计算字符,那么您需要识别组合标记并将它们计算为关联的非组合代码点的一部分。

Characters? Or codepoints? The two are not the same. Unicode is... complex. You could count all of these different things about a UTF-8 string: length in bytes, length in codepoints, length in characters, length in glyphs, and length in grapheme clusters. All of those might come out different for any given string!

My first inclination is to tell that broken client to go away. But assuming you can't do that you need to ask what exactly the client is counting. The simplest thing to count, after bytes, is codepoints -- that's what UTF-8 encodes, after all. After that? characters, but you need to have tables of composing codepoints so that you can identify sequences of codepoints that make up a character. If the client counts glyphs or grapheme clusters then you're in for a world of hurt. But most likely the client counts either codepoints or characters. If it counts codepoints then just count bytes with with binary values 10xxxxxx and 0xxxxxxx (though you probably want to implement enough UTF-8 to protect against overlong sequences). If it counts characters then you need to identify combining marks and count them as part of the associated non-combining codepoint.

枕梦 2024-11-05 08:22:30

如果您获得的长度与您获得的字节数不匹配,您有多种选择。

  1. 一次读取一个字节并将它们组装成字符,直到获得匹配数量的字符。

  2. 添加一个已知的终止符并完全跳过字符串大小。一次只读取一个字节,直到读取终止符序列。

  3. 读取标头中列出的字节数(因为这是最小数量)。弄清楚你是否有足够的角色。如果没有,请阅读更多内容!

If the length you get doesn't match the number of bytes you get, you have a couple of choices.

  1. Read one byte at a time and assemble them into characters until you get matching number of characters.

  2. Add a known terminator and skip the string size entirely. Just read one byte at a time until you read the terminator sequence.

  3. Read a the number of bytes listed in the header (since that's the minimum number). Figure out if you have enough characters. If not, read some more!

っ〆星空下的拥抱 2024-11-05 08:22:30

如果 DATA 不能包含 CRLF,似乎可以使用 CRLF 作为帧分隔符。只需忽略 SIZE 并读取直到 CRLF。

If the DATA can't contain a CRLF, it seems that you could use the CRLF as a framing delimiter. Just ignore the SIZE and read until CRLF.

橘寄 2024-11-05 08:22:30

这看起来正是我需要的东西。希望我早点找到它:

http://bjoern.hoehrmann.de/utf-8/解码器/dfa/

This looks like exactly the thing I'd need. Wish I found it earlier:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文