C 中带重音的字符到不带重音的字符

发布于 2024-09-19 00:31:42 字数 377 浏览 8 评论 0原文

嘿伙计们。简单的问题:如何从字符中删除重音符号?喜欢 ã -> a、和é-> e.我在另一个问题中问过如何将utf-8转换为ascii,但这是不必要的,因为我只需要处理这些情况。

我尝试过:

char comando;
if( comando == 'ç' || comando == 'Ç') {
        comando = 'c';
        return comando;
    }

但它给了我这个错误:“由于数据类型范围有限,比较总是错误的”。

我不能确定我的老师将要编译我的程序的 GCC 版本,但她会在 Linux(可能是 Ubuntu)上运行它。而且我无法使用标准库。 :(

谢谢!

Hey guys. simple question: how to remove accents from a char? Like ã -> a, and é -> e. I asked in another question how to convert utf-8 to ascii but this is unnecessary, since I only need to treat those situations.

I tried:

char comando;
if( comando == 'ç' || comando == 'Ç') {
        comando = 'c';
        return comando;
    }

But it gives me this error: "comparison is always false due to limited range of data type".

I can't be certain about the version of GCC that my teacher is going to compile my program, but she will run it on Linux (Ubuntu probably). And I can't use the standard lib. :(

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

溺孤伤于心 2024-09-26 00:31:42

作为对其他答案的补充,尝试一下大小:

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(int argc, char** argv)
{
    wchar_t* x = calloc(100, sizeof(wchar_t));
    char*    y = calloc(100, sizeof(char));

    printf("Input something: ");
    fread(y, 1, 99, stdin);

    mbstowcs(x, y, 100);

    if ( x[0] = L'è' )
    {
        printf("Ohhh, french character!\n");
    }


    free(y); free(x);

    return 0;
}

此代码向您展示了两件事:首先,如何将您读入的多字节字符串转换为宽字符串。从那里,您可以处理几乎所有存在的角色(至少理论上如此)。

完成此操作后,您只需要一个字符映射及其转换,这将允许您解析每个字符并将其映射到其他字符。 请参阅此问题的其他答案

一些注意事项:我在完成输入输入后故意在 stdin - ctrl+D 上使用 fread() 。这是为了防止缓冲区溢出攻击,如果您将结果传递给函数,您将很容易受到使用 scanf 的攻击(请参阅 NOP sled)。

其次,我盲目地假设 y 的输入大部分是单字节。事实上,如果在多字节字符串中每个字符使用两个字节,则 100 个 char 字符 = 50 个 wchar_t 字符。我也可以检查长度等,但这超出了本示例的范围。

In supplement to the other answers, try this for size:

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(int argc, char** argv)
{
    wchar_t* x = calloc(100, sizeof(wchar_t));
    char*    y = calloc(100, sizeof(char));

    printf("Input something: ");
    fread(y, 1, 99, stdin);

    mbstowcs(x, y, 100);

    if ( x[0] = L'è' )
    {
        printf("Ohhh, french character!\n");
    }


    free(y); free(x);

    return 0;
}

This code shows you two things: firstly, how to convert a multi-byte string you have read in into a wide character string. From there, you can handle nearly every character that exists (theoretically at least).

Having done this, you simply need a map of characters and their transform which will allow you to parse each character and map it to something else. See the other answers for this

Some notes: I've deliberately used fread() on stdin - ctrl+D when done typing input. This is to prevent a buffer overflow attack you would be vulnerable to using scanf if you passed the result to a function (see NOP sled).

Secondly, I have blindly assumed y's input will be mostly single byte. The fact is, if in the multi-byte string two bytes are being used per character, 100 char characters = 50 wchar_t characters. I could check lengths etc too, but that's beyond the scope of this example.

半衾梦 2024-09-26 00:31:42

C 标准规定,诸如“ç”之类的字符常量是整型常量:

§6.4.4.4/9

整型字符常量的类型为 int。整数字符常量的值
包含映射到单字节执行字符的单个字符是
解释为整数的映射字符表示形式的数值。

如果 char 类型在您的计算机上有符号(在 Linux 上),那么当 comando 包含 'ç' 并提升为整数时,它会变成负整数,而 'ç' 是正整数。因此编译器会发出警告。


对于 8 位字符集,迄今为止执行此类操作的最快方法是创建一个 256 字节的表,其中每个位置包含字符的非重音版本。

int unaccented(int c)
{
     static const char map[256] =
     {
          '\x00', '\x01', ...
          ...
          '0',    '1',    '2', ...
          ...
          'A',    'B',    'C', ...
          ...
          'a',    'b',    'c', ...
          ...
          'A',    'A',    'A', ... // 0xC0 onwards...
          ...
          'a',    'a',    'a', ... // 0xE0 onwards...
          ...
     };
     if (c < 0 || c > 255)
         return EOF;
     else
         return map[c];
}

当然,您可以编写一个程序(可能是脚本)来生成数据表,而不是手动执行。在 0..127 范围内,位置 x 处的字符是代码为 x 的字符(因此 map['A'] == 'A')。

如果您被允许利用 C99,您可以通过使用指定的初始值设定项来改进该表:

static const char map[] =
{
    ['\x00'] = '\x00', ...
    ['A']    = 'A', ...
    ['a']    = 'a', ...
    ['å']    = 'a', ...
    ['Å']    = 'A', ...
    ['ÿ']    = 'y', ...
};

目前尚不完全清楚您应该如何处理没有 diphthongs 字母,例如 'æ' 或 'ß' ASCII 等效;然而,“如有疑问,请勿更改”这一简单规则可以明智地应用。它们不是重音字符,但也不是 ASCII 字符。

这对于 UTF-8 来说效果不太好。为此,您需要由 Unicode 标准中的数据驱动的更专业的表。

另请注意,在调用此函数之前,您应该将任何“char”值强制转换为“unsigned char”。也就是说,该代码还可以尝试处理滥用者。然而,当人们在调用该函数时不小心时,很难区分'ÿ'(0xFF)和EOF。 C 标准字符测试宏需要支持所有有效字符值(当转换为无符号字符时)和 EOF 作为输入 - 这遵循该设计。

§7.4/1

在所有情况下,参数都是 int,其值应为
可表示为无符号字符或应等于宏 EOF 的值。如果
参数具有任何其他值,行为未定义。

The C standard says that the character constants such as 'ç' are integer constants:

§6.4.4.4/9

An integer character constant has type int. The value of an integer character constant
containing a single character that maps to a single-byte execution character is the
numerical value of the representation of the mapped character interpreted as an integer.

If the char type is signed on your machine (it is on Linux), then when comando contains 'ç' and is promoted to integer, it becomes a negative integer, whereas 'ç' is a positive integer. Hence the warning from the compiler.


For an 8-bit character set, by far the fastest way to do such an operation is to create a table of 256 bytes, where each position contains the unaccented version of the character.

int unaccented(int c)
{
     static const char map[256] =
     {
          '\x00', '\x01', ...
          ...
          '0',    '1',    '2', ...
          ...
          'A',    'B',    'C', ...
          ...
          'a',    'b',    'c', ...
          ...
          'A',    'A',    'A', ... // 0xC0 onwards...
          ...
          'a',    'a',    'a', ... // 0xE0 onwards...
          ...
     };
     if (c < 0 || c > 255)
         return EOF;
     else
         return map[c];
}

Of course, you'd write a program - probably a script - to generate the table of data, rather than doing it manually. In the range 0..127, the character at position x is the character with code x (so map['A'] == 'A').

If you are allowed to exploit C99, you can improve the table by using designated initializers:

static const char map[] =
{
    ['\x00'] = '\x00', ...
    ['A']    = 'A', ...
    ['a']    = 'a', ...
    ['å']    = 'a', ...
    ['Å']    = 'A', ...
    ['ÿ']    = 'y', ...
};

It isn't entirely clear what you should do with diphthongs letters such as 'æ' or 'ß' that have no ASCII equivalent; however, the simple rule of 'when in doubt, do not change it' can be applied sensibly. They aren't accented characters, but neither are they ASCII characters.

This does not work so well for UTF-8. For that, you need more specialized tables driven from data in the Unicode standard.

Also note that you should coerce any 'char' value to 'unsigned char' before calling this. That said, the code could also attempt to deal with abusers. However, it is hard to distinguish 'ÿ' (0xFF) from EOF when people are not careful in calling the function. The C standard character test macros are required to support all valid character values (when converted to unsigned char) and EOF as inputs - this follows that design.

§7.4/1

In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

缘字诀 2024-09-26 00:31:42

您在另一个类似的问题中提到,用您知道的其他语言可以很容易地做到这一点。如果我是您,并且无法找到使用 C 中的可用代码执行此操作的好方法,并且需要在 CI 中执行此操作,则会用另一种语言编写一个程序来生成一个 C 函数来为您执行转换。只要您可以循环遍历所有字符,这应该不会太困难,尽管它可能是很大的代码。我可能会为 utf-16 执行此操作,并且只有一个简单的包装函数,该函数采用 utf-8,将它们转换为 utf-16,然后调用转换函数。

转换函数将仅由一个非常大的 switch/case 语句组成,默认情况下将针对未转换的字符。

You mentioned in another similar question that this was easy enough to do in other languages that you know. If I were you and couldn't find a good way to do this with available code in C and needed to do so in C I would write a program in another language to generate a C function that would do the conversion for you. As long as you can cycle through all characters this shouldn't be too difficult, though it may be large code. I'd probably do this for utf-16, and just have a simple wrapper function that took utf-8, converted them to utf-16, and called the conversion function.

The conversion function would just be made of a very large switch/case statement, and the default case would be for characters that didn't convert.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文