U+ 到底是什么？代表以及为什么我不能在我的 C++ 中创建 Unicode 中间字符串表？应用？

发布于 2024-10-04 10:50:53 字数 1286 浏览 7 评论 0原文

我正在尝试将应用程序从 Java + Swing 转换为 C++ + Qt。有一次我不得不处理一些 Unicode 中间体。在 Java 中，这相当容易：

private static String[] hiraganaTable = {
    "\u3042", "\u3044", "\u3046", "\u3048", "\u304a", 
    "\u304b", "\u304d", "\u304f", "\u3051", "\u3053", 
    ...
}

...而在 C++ 中我遇到了问题：

QString hiraganaTable[] = {
    "\x30\x42", "\x30\x44", "\x30\x46", "\x30\x48", "\x30\x4a", 
    "\x30\x4b", "\x30\x4d", "\x30\x4f", "\x30\x51", "\x30\x53", 
    ...
};

我无法在 VS2008 中使用 \u，因为我收到了一堆以下形式的警告：

universal-character-name '\u3042' 表示的字符无法在当前代码页 (1250) 中表示

并且不要说我愚蠢，我尝试使用文件 - >高级保存选项无济于事，代码页没有'似乎根本没有改变。似乎这是一个已知问题：如何在 Visual C++ 2008 中创建 UTF-8 字符串文字

我使用的表相当短，因此在 Vim 和一些入门级 regexp-magic 的帮助下，我能够将其转换为 < strong>\x30\x42 表示法。不幸的是，QString 无法从这样的输入正确初始化。我尝试了一切。 fromAscii()、fromUtf8()、fromLocal8Bit()、QString(QByteArray)，这些都有效。然后，尝试将没有BOM的U+3042写入文件，然后以十六进制模式查看它，我发现它实际上是“E3 81 82”。突然，这样的条目似乎可以与 QString::fromAscii() 配合使用。现在我想知道“U+”在“U+3042”中究竟代表多少（因为0xE38182 - 0x3042 = E35140，也许我最好将这个魔术常数添加到我所有想要的Unicode字符中？）。我应该如何从这里继续获取正确的 UTF-8 字符串数组？

原文

I'm trying to convert an application from Java + Swing to C++ + Qt. At one point I had to deal with some Unicode intermediates. In Java, this was fairly easy:

private static String[] hiraganaTable = {
    "\u3042", "\u3044", "\u3046", "\u3048", "\u304a", 
    "\u304b", "\u304d", "\u304f", "\u3051", "\u3053", 
    ...
}

...whereas in C++ I'm having problems:

QString hiraganaTable[] = {
    "\x30\x42", "\x30\x44", "\x30\x46", "\x30\x48", "\x30\x4a", 
    "\x30\x4b", "\x30\x4d", "\x30\x4f", "\x30\x51", "\x30\x53", 
    ...
};

I couldn't use \u in VS2008 because I got a heap of warnings of the form:

character represented by universal-character-name '\u3042' cannot be represented in the current code page (1250)

And don't call me stupid, I tried to use File->Advanced Save Options to no avail, the codepage didn't seem to change at all. Seems like this is a known problem: How to create a UTF-8 string literal in Visual C++ 2008

The table I'm using is fairly short, so with the help of Vim and some introductory-level regexp-magic, I was able to convert it to \x30\x42 notation. Unfortunately, the QStrings would not initialize properly from such an input. I tried everything. fromAscii(), fromUtf8(), fromLocal8Bit(), QString(QByteArray), the works. Then, trying to write U+3042 without BOM to a file and then viewing it in hex mode, I found out it actually turns out to be "E3 81 82". Suddenly, an entry like this seemed to work with QString::fromAscii(). Now I'm left wondering how much does exactly the "U+" stand for in "U+3042" (since 0xE38182 - 0x3042 = E35140, maybe I'd better add this Magic Constant to all my would-be Unicode chars?). How should I proceed from here to get an array of proper UTF-8 strings?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

公布 2024-10-11 10:50:53

问题是 C++ 是基于 C 的，而 C 的历史可以追溯到 ASCII 时代。 “默认”C 字符串“abc”是 8 位。不过，您的 Visual C++ 编译器具有 16 位 Unicode (UTF-16) 文字，语法略有不同：L"abc\u3042"。此类文字的类型是 wchar_t[N] 而不是 char[N]，您可以将它们存储在 std::wstring 中。

Qt 完全理解 wchar_t 并且可以直接从它们构造 QString，而不会出现转换问题。

回复收藏 0 原文

对你再特殊 2024-10-11 10:50:53

您看到的是该字符的 UTF-8 编码。

>>> u'\u3042'.encode('utf-8').encode('hex')
'e38182'

如果你把它们全部写成 UTF-8 那么你应该没问题。

“U+”仅表示您正在查看 Unicode 代码点，而不是某些特定的编码。

编辑：

一个小脚本可以帮助您开始使用Python（与上面相同的语言）：

>>> print ',\n'.join(', '.join('"%s"' % (y.encode('utf-8').encode('string-escape')
      ,) for y in x) for x in [u'あいうえお', u'かきくけこ', u'さしすせそ'])
"\xe3\x81\x82", "\xe3\x81\x84", "\xe3\x81\x86", "\xe3\x81\x88", "\xe3\x81\x8a",
"\xe3\x81\x8b", "\xe3\x81\x8d", "\xe3\x81\x8f", "\xe3\x81\x91", "\xe3\x81\x93",
"\xe3\x81\x95", "\xe3\x81\x97", "\xe3\x81\x99", "\xe3\x81\x9b", "\xe3\x81\x9d"

What you're seeing is the UTF-8 encoding of that character.

>>> u'\u3042'.encode('utf-8').encode('hex')
'e38182'

If you write them all out in UTF-8 then you should be fine.

The "U+" just indicates that you're looking at a Unicode codepoint as opposed to some specific encoding.

EDIT:

A small scriptlet to help you get started, in Python (same language as above):

>>> print ',\n'.join(', '.join('"%s"' % (y.encode('utf-8').encode('string-escape')
      ,) for y in x) for x in [u'あいうえお', u'かきくけこ', u'さしすせそ'])
"\xe3\x81\x82", "\xe3\x81\x84", "\xe3\x81\x86", "\xe3\x81\x88", "\xe3\x81\x8a",
"\xe3\x81\x8b", "\xe3\x81\x8d", "\xe3\x81\x8f", "\xe3\x81\x91", "\xe3\x81\x93",
"\xe3\x81\x95", "\xe3\x81\x97", "\xe3\x81\x99", "\xe3\x81\x9b", "\xe3\x81\x9d"

回复收藏 0 原文