预计将在UTF-8中进行消毒字符串,但有时不在C+&#x2B中。

发布于 2025-01-18 01:18:53 字数 696 浏览 2 评论 0原文

我们将SQLite DBS加载到PostgreSQL中。

SQLITE期望UTF-8字符串,但相当宽松,而不是强制执行UTF-8。 虽然PostgreSQL是严格的,并且会通过此类字符串使交易失败。 在测试期间,偶尔确实会发生无效的字符串,因此我们必须做点什么。

我想检测到此类字符串,并用C ++中的U+FFFD替换无效的序列。 我有点研究std :: codecvt,但是所有示例均从一个编码到另一个示例, 在单个(UTF-8)编码中不进行消毒,而CPPReference.com Doc对我来说也不清楚Crystal ...

请注意,该计划是记录这些异常,以便数据管理器可以返回,之后 - 事实并手动查看字符串。如果他们确实识别原始字符串的本机(非UTF-8)编码,以自动消毒字符串(其中包含U+FFFD替换字符)。因此,可能是不可能的纯sqlite或PostgreSQL的解决方案(尽管我也很乐意阅读这些解决方案(如果有的话)。

我找到了手动执行的代码,但是在我走得太远那个兔子洞之前, 是否有可以实现上述内容的STD-C ++ 17代码?需要便携式解决方案(MSVC和GCC)。

PS:我可以检测到一个非UTF-8字符串,因为std :: wstring_convert&lt&lt&lt :: codecvt_utf8<消毒的字符串,也不告诉我 bad bad 序列在哪里。

We are loading SQLite DBs, into PostgreSQL.

SQLite expects UTF-8 strings, but is rather lenient, not enforcing UTF-8-ness.
While PostgreSQL is strict, and will fail the transaction with such strings.
During tests, once in a while, invalid strings do actually happen, so we must do something.

I'd like to detect such strings, and replace the invalid sequences with U+FFFD, in C++.
I've looked a little into std::codecvt, but all examples are from one encoding to another,
not sanitizing within a single (UTF-8) encoding, and the cppreference.com doc isn't crystal clear to me either...

Do note that the plan is to record those anomalies, so that data-managers can go back, after-the-fact, and manually review the strings. In case they do recognize the native (non-UTF-8) encoding of the original strings, to update the automatically sanitize strings (that contains U+FFFD replacement characters). Thus, solutions that are pure SQLite or PostgreSQL are probably not possible (although I'd be happy to read about those too, if any).

I've found code that does it manually, but before I go too far into that rabbit hole,
is there std-C++17 code that could achieve the above? Portable solution needed (MSVC and GCC).

PS: I can detect a non-UTF-8 strings, since std::wstring_convert<std::codecvt_utf8<wchar_t>>::from_bytes throws in that case, but that does not give me the sanitized string, nor does it tell me where the bad sequences are.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

懒的傷心 2025-01-25 01:18:53

我最终使用 https://github.com/nemtrif/nemtrif/utfcpp 由于UTFCPP的文献良好,并且看起来稳定。那个库是仅标题的,我将其包装,如下所示。

bool isUtf8(std::string_view text, size_t* p_invalid_offset) {
    size_t local_invalid = 0;
    size_t& invalid = p_invalid_offset? *p_invalid_offset: local_invalid;
    invalid = ::utf8::find_invalid(text);
    return invalid == std::string_view::npos;
}
std::string sanitizeUtf8(std::string_view text) {
    return ::utf8::replace_invalid(text);
}

另外,由于在我的情况下,无效的UTF-8文本很少见,而不是系统地使用sanitizeutf8(),我代替了:

size_t invalid = 0;
if (!isUtf8(sv, &invalid)) {
    std::string sanitized{ sv.substr(0, invalid) };
    sanitized += sanitizeUtf8(sv.substr(invalid));
    // use sanitized
} else {
    // use sv directly
}

这加快了速度,因为isutf8()>比sanitizeutf8()更快,在我的情况下,非UTF-8的频率也很低。

在性能方面,我在WIN10 MSVC2019上获得了大约270和140 MB/s(对于isutf8()sanitizeutf8()分别); REDHAT7 GCC9上的1'100和250 Mb/s(虽然较旧的机器)

PS:为了完整性,non-utf-8上的PostgreSQL错误是 prination_not_in_repertoire “ 22021”

I ended up using https://github.com/nemtrif/utfcpp, as suggested by Alan Birtles, since UTFCPP is well documented, and appears stable. That library is header-only, and I wrapped it as shown below.

bool isUtf8(std::string_view text, size_t* p_invalid_offset) {
    size_t local_invalid = 0;
    size_t& invalid = p_invalid_offset? *p_invalid_offset: local_invalid;
    invalid = ::utf8::find_invalid(text);
    return invalid == std::string_view::npos;
}
std::string sanitizeUtf8(std::string_view text) {
    return ::utf8::replace_invalid(text);
}

Also, since invalid UTF-8 text is rather rare in my case, instead of using sanitizeUtf8() systematically, I instead do:

size_t invalid = 0;
if (!isUtf8(sv, &invalid)) {
    std::string sanitized{ sv.substr(0, invalid) };
    sanitized += sanitizeUtf8(sv.substr(invalid));
    // use sanitized
} else {
    // use sv directly
}

Which speeds things up, since isUtf8() is faster than sanitizeUtf8(), and again the frequency of non-UTF-8 is very low in my case.

Performance-wise, I get around 270 and 140 MB/s on Win10 MSVC2019 (for isUtf8() and sanitizeUtf8() respectively); 1'100 and 250 MB/s on RedHat7 GCC9 (older machine though)

PS: For completeness, the PostgreSQL error on non-UTF-8 is character_not_in_repertoire "22021".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文