是否可以让GCC编译带有BOM的UTF-8源文件？

发布于 2024-12-11 19:58:02 字数 1249 浏览 0 评论 0原文

我在 Windows 上使用 Microsoft Visual Studio 并在 Ubuntu Linux 上使用 GCC 开发 C++ 跨平台。

在 Visual Studio 中，我可以在代码中使用“π”和“²”等 Unicode 符号。 Visual Studio 始终将源文件保存为带有 BOM（字节顺序标记）的 UTF-8。

例如：

// A = π.r²
double π = 3.14;

只有当我先删除 BOM 时，GCC 才会愉快地编译这些文件。如果我不删除 BOM，则会收到如下错误：

wwga_hydutils.cpp:28:9: 错误：程序中存在杂散“\317”
wwga_hydutils.cpp:28:9: 错误：程序中存在杂散“\200”

这让我想到了一个问题：

有没有办法让GCC 编译UTF-8 文件无需先删除 BOM？

我正在使用：

Windows 7
Visual Studio 2010

和：

正如第一个评论者指出的那样，我的问题是不是 BOM，但在字符串常量之外包含非 ASCII 字符。 GCC 不喜欢符号名称中的非 ASCII 字符，但事实证明 GCC 与带有 BOM 的 UTF-8 完全兼容。

原文

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on Ubuntu Linux.

In Visual Studio, I can use Unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).

For example:

// A = π.r²
double π = 3.14;

GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:

wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program
wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program

Which brings me to the question:

Is there a way to get GCC to compile UTF-8 files without first removing the BOM?

I'm using:

Windows 7
Visual Studio 2010

and:

Ubuntu 11.10 (Oneiric Ocelot)
GCC 4.6.1, 2011-06-27 (as provided by apt-get install gcc)

As the first commenter pointed out, my problem was not the BOM, but having non-ASCII characters outside of string constants. GCC does not like non-ASCII characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

誰ツ都不明白 2024-12-18 19:58:02

根据 GCC Wiki，尚不支持此功能。您可以使用 -fextended-identifiers 并预处理代码以将标识符转换为 UCN。从链接页面：

perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'

另请参阅 g++ unicode 变量名称和 Unicode 标识符和源代码C++11？

According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:

perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'

回复收藏 0 原文

〆一缕阳光ご 2024-12-18 19:58:02

虽然 GCC 支持 Unicode 标识符，但不支持 UTF-8 输入。因此，Unicode 标识符必须使用 \uXXXX 和 \UXXXXXXXX 转义码进行编码。然而，只要安装了支持 C99 转换的最新版本 iconv，C++ 预处理器的一个简单的一行补丁就允许 GCC 和 g++ 处理 UTF-8 输入。详细信息请参见 GCC 中的 UTF-8 标识符< /em>。

然而，补丁非常简单，可以在此处给出：

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c

输出：

*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

即使使用补丁，也有两个命令行选项（-finput-charset 和 -fextended-identifiers）需要启用 UTF -8输入。特别是，尝试类似的东西

/usr/local/gcc-5.2/bin/gcc \
    -finput-charset=UTF-8 -fextended-identifiers \
    -o circle circle.c

While Unicode identifiers are supported in GCC, UTF-8 input is not. Therefore, Unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the C++ preprocessor allows GCC and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at UTF-8 Identifiers in GCC.

However, the patch is so simple it can be given right here:

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c

Output:

*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

Even with the patch, two command line options (-finput-charset and -fextended-identifiers) are needed to enable UTF-8 input. In particular, try something like

/usr/local/gcc-5.2/bin/gcc \
    -finput-charset=UTF-8 -fextended-identifiers \
    -o circle circle.c

回复收藏 0 原文

~没有更多了~

关于作者

哽咽笑

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

是否可以让GCC编译带有BOM的UTF-8源文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

是否可以让GCC编译带有BOM的UTF-8源文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。