如何在C中计算unicode字符串中的字符数

发布于 2024-12-02 17:55:59 字数 305 浏览 3 评论 0原文

假设我有一个字符串:

char theString[] = "你们好āa";

假设我的编码是 utf-8,该字符串的长度为 12 个字节(三个汉字字符各为三个字节,带有宏字符的拉丁字符为两个字节,“a”为一个字节:

strlen(theString) == 12

我怎样才能计算字符数?我怎样才能做相当于下标的事情:

theString[3] == "好"

我怎样才能切片和捕捉这样的字符串?

Lets say I have a string:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the 'a' is one byte:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

theString[3] == "好"

How can I slice, and cat such strings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

许久 2024-12-09 17:55:59

您只计算前两位未设置为 10 的字符(即小于 0x80 或大于 0xbf 的所有字符)。

这是因为所有前两位设置为 10 的字符都是 UTF-8 连续字节。

有关编码的说明,请参阅此处以及 strlen 如何处理 UTF-8 字符串。

对于 UTF-8 字符串的切片和切块,基本上必须遵循相同的规则。任何以 0 位或 11 序列开头的字节都是 UTF-8 代码点的开始,所有其他字节都是连续字符。

如果您不想使用第三方库,最好的选择是简单地提供以下函数:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

分别获取:

  • 字符串的左侧 sz UTF-8 字节。
  • 字符串的 sz UTF-8 字节,从 pos 开始。
  • 字符串的其余 UTF-8 字节,从 pos 开始。

这将是一个不错的构建块,能够充分操纵字符串以满足您的目的。


但是,您可能需要加强对字符的定义,以及如何计算字符串的大小。

如果您将一个字符视为 Unicode 代码点,那么上面的信息就完全足够了。

但您可能更喜欢不同的方法。详细说明字素簇边界的 附件 29 文档包含以下片段:

重要的是要认识到,用户所认为的“字符”(语言书写系统的基本单位)可能不仅仅是单个 Unicode 代码点。

一个简单的示例是 ,它可以被视为单个字符,但由两个 Unicode 代码点组成:

  • 0067 (g) 拉丁文小写字母 G< /代码>;和
  • 0308 (◌̈) 结合分音法

如果您使用“任何不属于二进制形式 10xxxxxx 的字符都是新字符的开头”的规则,那么这将显示为两个不同的 Unicode 字符。

附件 29 还用一个更用户友好的名称来称呼这些字素簇,即用户感知的字符。如果您想要计算这些,该附件会提供更多详细信息。

You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

That's because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

See here for a description of the encoding and how strlen can work on a UTF-8 string.

For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

Your best bet, if you don't want to use a third-party library, is to simply provide functions along the lines of:

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

to get, respectively:

  • the left sz UTF-8 bytes of a string.
  • the sz UTF-8 bytes of a string, starting at pos.
  • the rest of the UTF-8 bytes of a string, starting at pos.

This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.


However, you may need to tighten up your definition of what a character is, and hence how to calculate the size of a string.

If you consider a character to be a Unicode code point, the information above is perfectly adequate.

But you may prefer a different approach. The Annex 29 documentation detailing grapheme cluster boundaries has this snippet:

It is important to recognize that what the user thinks of as a "character" - a basic unit of a writing system for a language - may not be just a single Unicode code point.

One simple example is , which can be thought of as a single character but consists of the two Unicode code points:

  • 0067 (g) LATIN SMALL LETTER G; and
  • 0308 (◌̈ ) COMBINING DIAERESIS.

That would show up as two distinct Unicode characters were you to use the rule "any character not of the binary form 10xxxxxx is the start of a new character".

Annex 29 also calls these grapheme clusters by a more user-friendly name, user-perceived characters. If it's those you wish to count, that annex gives further details.

伤痕我心 2024-12-09 17:55:59

尝试以下大小:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

示例运行:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

请注意,您的示例有一个误差。 theString[2] == "好"

Try this for size:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

Sample run:

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

Note that your example has an off by one error. theString[2] == "好"

—━☆沉默づ 2024-12-09 17:55:59

最简单的方法是使用像 ICU 这样的库

The easiest way is to use a library like ICU

獨角戲 2024-12-09 17:55:59

根据您对“性格”的概念,这个问题可能会或多或少地涉及。

首先,您应该将字节字符串转换为 unicode 代码点字符串。您可以使用 ICU 的 iconv() 来完成此操作,但如果这是您唯一要做的事情,则 iconv() 会容易得多,而且它是 POSIX 的一部分。

您的 unicode 代码点字符串可能类似于以 null 结尾的 uint32_t[],或者如果您有 C1x,则可能是 char32_t 数组。该数组的大小(即它的元素数量,而不是它的字节大小)是代码点的数量(加上终止符),这应该给你一个很好的开始。

然而,“可打印字符”的概念相当复杂,您可能更喜欢计算字素而不是代码点 - 例如,带有重音符号a ^ 可以表示为两个 unicode 代码点,或表示为组合的遗留代码点 â - 两者都是有效的,并且 unicode 标准要求对两者进行同等对待。有一个称为“规范化”的过程,它将您的字符串转换为确定的版本,但是有许多字素无法表达为单个代码点,并且通常没有办法绕过正确的库来理解这一点并为您计算字素。

也就是说,由您决定脚本的复杂程度以及处理它们的彻底程度。转换为 unicode 代码点是必须的,除此之外的一切都由您自行决定。

如果您决定需要 ICU,请毫不犹豫地询问有关 ICU 的问题,但请随意先探索一下简单得多的 iconv()

Depending on your notion of "character", this question can get more or less involved.

First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.

Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.

However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.

That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.

Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.

面如桃花 2024-12-09 17:55:59

在现实世界中,theString[3]=foo; 不是一个有意义的操作。为什么您想要用不同的字符替换字符串中特定位置的字符?当然,没有任何自然语言文本处理任务对此操作有意义。

计算字符也不太可能有意义。 “á”中有多少个字符(根据您对“字符”的理解)? “á”怎么样?现在“གི”怎么样?如果您需要此信息来实现某种文本编辑,您将不得不处理这些难题,或者只使用现有的库/GUI 工具包。我会推荐后者,除非您是世界文字和语言的专家并且认为您可以做得更好。

对于所有其他目的,strlen 准确地告诉您实际有用的信息:字符串占用多少存储空间。这是组合和分离字符串所需要的。如果您只想组合字符串或在特定分隔符处分隔字符串,则可以使用 snprintf(或 strcat,如果您坚持...)和 strstr 就是您所需要的。

如果您想执行更高级别的自然语言文本操作,例如大写、换行等,甚至是更高级的操作,例如复数、时态变化等,那么您将需要像 ICU 这样的库或相应的东西更高水平和语言能力(并且特定于您正在使用的语言)。

同样,大多数程序对此类事情没有任何用处,只需要组装和解析文本,而不考虑自然语言。

In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.

Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.

For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.

If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).

Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.

温柔嚣张 2024-12-09 17:55:59
while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

这将计算 UTF-8 字符串中的字符...(在本文中找到:甚至更快的 UTF-8 字符计数

但是我仍然对切片和连接感到困惑?!?

while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)

However I'm still stumped on slicing and concatenating?!?

初雪 2024-12-09 17:55:59

一般来说,我们应该对 unicode 字符使用不同的数据类型。

例如,您可以使用宽字符数据类型

wchar_t theString[] = L"你们好āa";

请注意 L 修饰符,它表明字符串由宽字符组成。

可以使用 wcslen 函数,其行为类似于 strlen

In general we should use a different data type for unicode characters.

For example, you can use the wide char data type

wchar_t theString[] = L"你们好āa";

Note the L modifier that tells that the string is composed of wide chars.

The length of that string can be calculated using the wcslen function, which behaves like strlen.

你的背包 2024-12-09 17:55:59

从上面的答案中不清楚的一件事是为什么它不简单。每个字符都以一种或另一种方式编码 - 例如,它不必是 UTF-8 - 并且每个字符可能有多种编码,用不同的方式来处理重音组合等。规则非常复杂,并且因编码而异(例如,utf-8 与 utf-16)。

这个问题存在巨大的安全隐患,因此必须正确完成此操作。使用操作系统提供的库或知名的第三方库来操作unicode字符串;不要自己动手。

One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).

This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.

゛时过境迁 2024-12-09 17:55:59

几年前我也做过类似的实施。但我身上没有代码。

对于每个 unicode 字符,第一个字节描述了其后构建 unicode 字符的字节数。根据第一个字节,您可以确定每个 unicode 字符的长度。

我认为它是一个很好的 UTF8 库。
在此处输入链接说明

I did similar implementation years back. But I do not have code with me.

For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.

I think its a good UTF8 library.
enter link description here

热血少△年 2024-12-09 17:55:59

如果您的程序在 UTF-8 语言环境中运行,则标准 mbrlen() 函数正是您在此处查找的内容。

请注意,它将计算代码点的数量,因此重音等组合字符可能会单独计算。如果这是不可取的,您需要一个字符处理库,例如 ICU。

If your program is running in a UTF-8 locale, then the standard mbrlen() function does exactly what you are looking for here.

Note that it will count the number of codepoints, so combining characters such as accents may be counted separately. If that's undesirable, you need a character handling library such as ICU.

陪我终i 2024-12-09 17:55:59

在许多其他非西欧语言(例如:所有印度语言)中,代码点序列构成单个音节/字母/字符

因此,当您计算长度或查找子字符串时(肯定存在查找子字符串的用例) - 假设玩刽子手游戏),您需要逐个音节推进,而不是逐个代码点推进。

因此,字符/音节的定义以及您实际上将字符串分解为“音节块”的位置取决于您正在处理的语言的性质。
例如,许多印度语言(印地语、泰卢固语、卡纳达语、马拉雅拉姆语、尼泊尔语、泰米尔语、旁遮普语等)中的音节模式可以是以下任何一种

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

您需要解析字符串并查找上述模式来打破字符串并查找子字符串。

我认为不可能有一种通用方法可以神奇地以上述方式打破任何 unicode 字符串(或代码点序列)的字符串 - 因为适用于一种语言的模式可能不适用于另一种字母;

我想可能有一些方法/库可以将一些定义/配置参数作为输入来将 unicode 字符串分解成这样的音节块。但不确定!如果有人可以分享他们如何使用任何商业可用或开源方法解决此问题,我们将不胜感激。

A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)

So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.

So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with.
For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

You need to parse the string and look for the above patterns to break the string and to find the substrings.

I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;

I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文