恢复运行时 unicode 字符串

发布于 2024-12-16 22:36:26 字数 491 浏览 6 评论 0原文

我正在构建一个通过 tcp 接收带有编码 unicode 的运行时字符串的应用程序，示例字符串为“\u7cfb\u8eca\u4e21\uff1a\u6771\u5317 ...”。我有以下内容，但不幸的是我只能在编译时受益，因为：不完整的通用字符名称 \u 因为它在编译时期望 4 个十六进制字符。

QString restoreUnicode(QString strText)
   {
      QRegExp rx("\\\\u([0-9a-z]){4}");
      return strText.replace(rx, QString::fromUtf8("\u\\1"));
   }

我正在运行时寻找解决方案，我可以预见分解这些字符串并进行一些操作以将“\u”分隔符后面的那些十六进制转换为基数10，然后将它们传递到 QChar 的构造函数中，但我正在寻找如果存在更好的方法，因为我非常关心这种方法所产生的时间复杂性并且不是专家。

有没有人有任何解决方案或提示。

原文

I'm building an application that receives runtime strings with encoded unicode via tcp, an example string would be "\u7cfb\u8eca\u4e21\uff1a\u6771\u5317 ...". I have the following but unfortunately I can only benefit from it at compile time due to: incomplete universal character name \u since its expecting 4 hexadecimal characters at compile time.

QString restoreUnicode(QString strText)
   {
      QRegExp rx("\\\\u([0-9a-z]){4}");
      return strText.replace(rx, QString::fromUtf8("\u\\1"));
   }

I'm seeking a solution at runtime, I could I foreseen break up these strings and do some manipulation to convert those hexadecimals after the "\u" delimiters into base 10 and then pass them into the constructor of a QChar but I'm looking for a better way if one exists as I am very concerned about the time complexity incurred by such a method and am not an expert.

Does anyone have any solutions or tips.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽清梦 2024-12-23 22:36:26

您应该自己解码该字符串。只需获取 Unicode 条目 (rx.indexIn(strText))，解析它 (int result; std::istringstream iss(s); if (!(iss>>std:: hex>>result).fail()) ... 并将原始字符串 \\uXXXX 替换为 (wchar_t)result。

回复收藏 0 原文

猫腻 2024-12-23 22:36:26

对于闭包以及将来遇到此线程的任何人，这是在优化这些变量的范围之前我的初始解决方案。不是它的粉丝，但考虑到我无法控制的流中的 unicode 和/或 ascii 的不可预测性（仅限客户端），它的工作原理，虽然 Unicode 存在率很低，但最好处理它而不是丑陋的 \u1234 ETC。

QString restoreUnicode(QString strText)
{
    QRegExp rxUnicode("\\\\u([0-9a-z]){4}");

    bool bSuccessFlag;
    int iSafetyOffset = 0;
    int iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);

    while (iNeedle != -1)
    {
        QChar cCodePoint(strText.mid(iNeedle + 2, 4).toInt(&bSuccessFlag, 16));

        if ( bSuccessFlag )
            strText = strText.replace(strText.mid(iNeedle, 6), QString(cCodePoint));
        else
            iSafetyOffset = iNeedle + 1; // hop over non code point to avoid lock

        iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
    }

    return strText;
}

For closure and anyone who comes across this thread in future, here is my initial solution before optimising the scope of these variables. Not a fan of it but it works given the unpredictable nature of unicode and/or ascii in the stream of which I have no control over (client only), whilst Unicode presence is low, it is good to handle it instead of ugly \u1234 etc.

QString restoreUnicode(QString strText)
{
    QRegExp rxUnicode("\\\\u([0-9a-z]){4}");

    bool bSuccessFlag;
    int iSafetyOffset = 0;
    int iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);

    while (iNeedle != -1)
    {
        QChar cCodePoint(strText.mid(iNeedle + 2, 4).toInt(&bSuccessFlag, 16));

        if ( bSuccessFlag )
            strText = strText.replace(strText.mid(iNeedle, 6), QString(cCodePoint));
        else
            iSafetyOffset = iNeedle + 1; // hop over non code point to avoid lock

        iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
    }

    return strText;
}

回复收藏 0 原文

笑看君怀她人 2024-12-23 22:36:26

#include <assert.h>
#include <iostream>
#include <string>
#include <sstream>
#include <locale>
#include <codecvt>          // C++11
using namespace std;

int main()
{
    char const  data[]  = "\\u7cfb\\u8eca\\u4e21\\uff1a\\u6771\\u5317";

    istringstream   stream( data );

    wstring     ws;
    int         code;
    char        slashCh, uCh;
    while( stream >> slashCh >> uCh >> hex >> code )
    {
        assert( slashCh == '\\' && uCh == 'u' );
        ws += wchar_t( code );
    }

    cout << "Unicode code points:" << endl;
    for( auto it = ws.begin();  it != ws.end();  ++it )
    {
        cout << hex << 0 + *it << endl;
    }
    cout << endl;

    // The following is C++11 specific.
    cout << "UTF-8 encoding:" << endl;
    wstring_convert< codecvt_utf8< wchar_t > >  converter;
    string const bytes = converter.to_bytes( ws );
    for( auto it = bytes.begin();  it != bytes.end();  ++it )
    {
        cout << hex << 0 + (unsigned char)*it << ' ';
    }
    cout << endl;
}

#include <assert.h>
#include <iostream>
#include <string>
#include <sstream>
#include <locale>
#include <codecvt>          // C++11
using namespace std;

int main()
{
    char const  data[]  = "\\u7cfb\\u8eca\\u4e21\\uff1a\\u6771\\u5317";

    istringstream   stream( data );

    wstring     ws;
    int         code;
    char        slashCh, uCh;
    while( stream >> slashCh >> uCh >> hex >> code )
    {
        assert( slashCh == '\\' && uCh == 'u' );
        ws += wchar_t( code );
    }

    cout << "Unicode code points:" << endl;
    for( auto it = ws.begin();  it != ws.end();  ++it )
    {
        cout << hex << 0 + *it << endl;
    }
    cout << endl;

    // The following is C++11 specific.
    cout << "UTF-8 encoding:" << endl;
    wstring_convert< codecvt_utf8< wchar_t > >  converter;
    string const bytes = converter.to_bytes( ws );
    for( auto it = bytes.begin();  it != bytes.end();  ++it )
    {
        cout << hex << 0 + (unsigned char)*it << ' ';
    }
    cout << endl;
}

回复收藏 0 原文

~没有更多了~