如何在普通 C89 中读取给定字符长度的 UTF-8 字符串？

发布于 2024-10-29 08:22:30 字数 777 浏览 10 评论 0原文

我正在用纯 C89 编写一个自定义的跨平台简约 TCP 服务器。（但我也会接受特定于 POSIX 的答案。）

服务器使用 UTF-8 字符串，但从不查看它们的内部。它将所有字符串视为不可变的二进制 blob。

但现在我需要接受来自客户端的 UTF-8 字符串，该字符串不知道如何计算其大小（以字节为单位）。客户端只能传输以字符为单位的字符串长度。 （更新：客户端使用 JavaScript，“字符长度”实际上是 String.length() 返回的内容。我认为它是实际的 UTF-8 字符，而不是其他字符否则。）

我不想向我的小型服务器添加严重的依赖项。有没有一种强大而简洁的方法来读取这个数据报？（为了解决这个问题，假设它是从 FILE * 读取的。）

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

示例：

U
7
Юникод!

更新：

一批数据可以包含多个数据报，因此近似读取不起作用，我需要读取确切数量的字符。

实际的 UTF-8 数据可能包含任何字符，因此我无法选择一个字符作为终止符 - 我不想在数据中对其进行转义。

原文

I'm writing a custom cross-platform minimalistic TCP server in plain C89. (But I will also accept POSIX-specific answer.)

The server works with UTF-8 strings, but never looks inside them. It treats all strings as immutable binary blobs.

But now I need to accept UTF-8 strings from the client that does not know how to calculate their size in bytes. The client can only transmit string length in characters. (Update: The client is in JavaScript, and "length in characters" is, in fact, whatever String.length() returns. I assume it is actual UTF-8 characters, not something else.)

I do not want to add heavy dependencies to my tiny server. Is there a robust and neat way to read this datagram? (For the sake of this question, let's say that it is read from FILE *.)

U<CRLF>       ; data type marker (actually read by dispatching code)
<SIZE><CRLF>  ; UTF-8 string size in characters
<DATA><CRLF>  ; data blob

Example:

U
7
Юникод!

Update:

One batch of data can contain more than one datagram, so approximate reads would not work, I need to read exact amount of characters.

And the actual UTF-8 data may contain any characters, so I can't pick a character as a terminator — I don't want mess with escaping it in the data.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一抹苦笑 2024-11-05 08:22:30

根据此处的信息，编写 UTF-8“阅读器”非常容易； UTF-8 的设计初衷就是让像这样的任务变得很容易。

本质上，你开始阅读字符，直到你读到客户告诉你的数量。您知道您已经根据 UTF-8 编码定义读取了整个字符，具体来说：

如果字符仅由编码
一个字节，高位为0，
其他位给出代码值（在
范围 0..127)。如果角色是
由多个序列编码
字节，第一个字节有那么多
前导“1”位作为总数
序列中的字节数，后跟
'0' 位和后续字节
全部由前导“10”位标记
模式。

回复收藏 0 原文

海拔太高太耀眼 2024-11-05 08:22:30

好吧，JavaScript字符串的长度属性似乎计算了Codepoint，而不是字符，如您所见（但请等待！这不是CodePoints）：

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

尽管这是V8。环顾四周，似乎这实际上是eCmascript标准所需的：

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-string-length-length-and-intergth-and-internation-web-applications

，也检查ECMA-262，on PDF的第40-41页说：“字符串的长度是其中的元素数（即16位值）”，然后继续清楚地表明元素是UTF-16单位。遗憾的是，这不完全是“代码点”。基本上，这使字符串长度属性变得毫无用处。环顾四周，我发现了这一点：

我告诉字符串是否包含javaScript中的多重字符？

Well, the length property of JavaScript strings seems to count codepoints, not characters, as you can see (but wait! it's not quite codepoints):

> s1='\u0061\u0301'
'á'
> s2='\u00E1'
'á'
> s1.length
2
> s2.length
1
>

Although that's with V8. Looking around it seems that that's actually what the ECMAScript standard requires:

https://forums.teradata.com/blog/jasonstrimpel/2011/11/javascript-string-length-and-internationalizing-web-applications

Also, checking ECMA-262, on pages 40-41 of the PDF it says "The length of a String is the number of elements (i.e., 16-bit values) within it", and then goes on to make clear that the elements are UTF-16 units. Sadly that's not quite "codepoints". Basically, this makes the string length property rather useless. Looking around I find this:

How can I tell if a string contains multibyte characters in Javascript?

回复收藏 0 原文

爱情眠于流年 2024-11-05 08:22:30

人物？或者代码点？两者并不相同。 Unicode 很……复杂。您可以计算有关 UTF-8 字符串的所有这些不同的内容：以字节为单位的长度、以代码点为单位的长度、以字符为单位的长度、以字形为单位的长度以及以字素簇为单位的长度。对于任何给定的字符串，所有这些结果可能会有所不同！

我的第一反应是让那个心碎的客户走开。但假设你不能这样做，你需要询问客户到底在数什么。除了字节之外，最简单的计数就是代码点——毕竟，这就是 UTF-8 编码的内容。在那之后？字符，但您需要有组成代码点的表，以便可以识别组成字符的代码点序列。如果客户计算字形或字素簇，那么您就会陷入痛苦的境地。但很可能客户端会计算代码点或字符。如果它计算代码点，则只需使用二进制值 10xxxxxx 和 0xxxxxxx 计算字节（尽管您可能希望实现足够的 UTF-8 以防止序列过长）。如果它计算字符，那么您需要识别组合标记并将它们计算为关联的非组合代码点的一部分。

回复收藏 0 原文