将 boost::format %s 说明符与 UTF-8 字符串一起使用

发布于 2024-11-06 12:11:46 字数 491 浏览 3 评论 0原文

我们正在向具有大型代码库的现有应用程序添加对 UTF8 的支持。此应用程序使用 boost::format(),非 ASCII 字符的输出未正确对齐。具体来说,当使用 %{width}.{length}s 说明符时,boost::format() 会计算字符数,这不会使用 utf8“做正确的事情”字符串。我认为应该可以更改字符串长度代码(可能string::size())以使用utf8len()或类似的东西,基于……某事?

在这种情况下,更改现有代码库以使用 UCS2(或 UCS4、UTF-16 等)是不切实际的,但如果需要,可以修改 boost::format() 。我希望其他人也遇到过这种需求,并能为我指出一个可能的解决方案。

注意:我发现了一些关于使用 utf8 语言环境的网页,但其中大多数似乎更适用于在流中与 utf8 和 UCS4 相互转换。

We are adding support for UTF8 to an existing application with a large code base. This application uses boost::format(), and the output in non-ASCII characters is not aligning properly. Specifically, when using the %{width}.{length}s specifier, boost::format() counts chars, which does not "do the right thing" with utf8 strings. I think it should be possible to change the string length code (which is probably string::size()) to use utf8len() or something analogous, based on ... something?

In this case, it is not practical to change the existing code base to use UCS2 (or UCS4, or UTF-16, etc), but it is possible to modify boost::format() if necessary. I was hoping someone else had run across this need, and can point me to a possible solution.

Note: I found some web pages on using locales with utf8, but most of that seemed more applicable to converting to/from utf8 and UCS4 in streams.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

凶凌 2024-11-13 12:11:46

这对您来说可能为时已晚,但也许对其他人有帮助。 Boost::format 接受 std::locale 作为可选模板参数。 (参见http://www.boost.org/doc /libs/1_55_0/libs/format/doc/format.html)。如果您向其传递一个 unicod 感知区域设置,例如 boost::locale("en_US.UTF-8"),您应该获得所需的行为。

您还可以设置应用程序的默认区域设置,而不是每次都将区域设置传递给 boost::format 构造函数,这可能会帮助您避免其他问题。如果您采取这条路线,我建议在 std::locale 上使用 boost::locale ,因为 boost::locale 不会修改您的数字格式,除非您明确要求它(文档 此处)。

一般来说,这是使 C++ 应用程序能够与 Unicode 良好配合的一种 goto 方法。如果该功能可以使用语言环境(std::regex、std::sort、boost::format),给它一个支持 unicode 的语言环境,那么你应该是安全的(如果你不是,请告诉我,我想知道)。

如果您正在制作一个小型、轻量级的应用程序并且只关心 80% 的情况,您可能不想为包含 ICU(Unicode 国际组件)付出代价,ICU 是提供 unicde 支持时默认的引擎 boost 语言环境。在这种情况下,使用操作系统或 Posix unicode 支持构建 Boos,并且您的应用程序将保持小而轻,但您不会有很多 unicode 支持,例如多个排序规则级别。

对于您所描述的问题,Posix 支持可能就足够了。

This is probably too late for you, but maybe it will help someone else. Boost::format accepts a std::locale as an optional template parameter. (see http://www.boost.org/doc/libs/1_55_0/libs/format/doc/format.html). If you pass it a unicod aware locale, such as the boost::locale("en_US.UTF-8"), you should get the desired behavior.

Instead of passing a locale each time to the boost::format constructor, you could also set the default locale of your application, which might help you avoid other problems. If you take this route, I would recomment the use of a boost::locale over a std::locale, as the boost::locale's won't modify your numeric formatting unless you explicity ask it to (docs here).

In general, this is a goto approach for making an application in C++ work nicely with Unicode. If the functionality can use a locale (std::regex, std::sort, boost::format), give it a unicode aware locale, and you should be safe (and if you arent' please tell me, I want to know).

If you are making a small, lightweight application and only care about the 80% case, you may not want to pay the price for including ICU (Internation Components for Unicode) which is the default engine boost locale wraps around when providing unicde support. In this case build Boos using your OS's or Posix unicode support, and your application will remain small and light, but you won't have a lot of unicode support, like multiple collation levels.

For the problem you are describing, Posix support is likely sufficent.

云仙小弟 2024-11-13 12:11:46

AFAIK Boost Format 会测量代码单元中的所有内容,即使使用基于 UTF-8 的语言环境也是如此。

如果您可以切换到另一个库,请考虑 C++20 std ::format{fmt} 格式化库,其中计算宽度显示宽度单位(类似于 wcswidth),因此对齐正确。例如

fmt::print("┌{0:─^{2}}┐\n"
           "│{1: ^{2}}│\n"
           "└{0:─^{2}}┘\n", "", "Hello, world!", 20);

打印:

┌────────────────────┐
│   Hello, world!    │
└────────────────────┘

免责声明:我是 {fmt} 和 C++20 std::format 的作者

AFAIK Boost Format measures everything in code units even when a UTF-8 based locale is used.

If you can switch to another library, then consider C++20 std::format or the {fmt} formatting library which count width in display width units (similarly to wcswidth) so the alignment is correct. For example

fmt::print("┌{0:─^{2}}┐\n"
           "│{1: ^{2}}│\n"
           "└{0:─^{2}}┘\n", "", "Hello, world!", 20);

prints:

┌────────────────────┐
│   Hello, world!    │
└────────────────────┘

Disclaimer: I'm the author of {fmt} and C++20 std::format

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文