C++ UTF-8轻量级&许可代码?

发布于 2024-09-04 21:52:14 字数 343 浏览 4 评论 0原文

任何人都知道更宽松的许可证(麻省理工学院/公共领域)版本:

http ://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

(std::string 的“直接”替换,支持 UTF-8)

轻量级,可以完成我需要的一切,甚至更多(怀疑我什至会使用 UTF-XX 转换)

我真的不想随身携带 ICU。

Anyone know of a more permissive license (MIT / public domain) version of this:

http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

('drop-in' replacement for std::string thats UTF-8 aware)

Lightweight, does everything I need and even more (doubt I'll use the UTF-XX conversions even)

I really don't want to be carrying ICU around with me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

送舟行 2024-09-11 21:52:14
  1. std::string 适合 UTF-8 存储。
  2. 如果您需要分析文本本身,那么 UTF-8 意识不会对您有太大帮助,因为 Unicode 中有太多东西不适用于代码点基础。

看一下 Boost.Locale 库(它在底层使用 ICU):

它不是轻量级的,但它允许您正确处理 Unicode,并且它使用 std::string 作为存储。

如果你期望找到 Unicode 感知的轻量级库来处理字符串,你将找不到这样的东西,因为 Unicode 不是轻量级的。即使是相对“简单”的东西,例如大写、小写转换或 Unicode 规范化,也需要复杂的算法和 Unicode 数据库访问。

如果您需要迭代代码点的能力(顺便说一句,这些代码点不是字符)
看看 http://utfcpp.sourceforge.net/

评论答案:

1) 查找我包含的文件的文件格式

std::string::find 非常适合于此。

2)断线检测

这不是一个简单的问题。您是否尝试过在中文/日文文本中查找换行符?可能不是,因为空格不能分隔单词。所以断行检测是一项艰巨的工作。 (我认为连 glib 都不能正确地做到这一点,我认为只有 pango 有类似的东西)

当然 Boost.Locale 可以正确地做到这一点。

如果您只需要对欧洲语言执行此操作,只需搜索空格或标点符号,因此 std::string::find 就更好了。

3)字符(或现在,代码点)计数查看 utfcpp thx

字符不是代码点,例如希伯来语单词 Shalom - “שָלוֹם” 由 4 个字符和 6 个 Unicode 点组成,其中两个代码点用于元音。对于欧洲语言也是如此,其中单个字符并用两个代码点表示,例如:“ü”可以表示为“u”和“¡”——两个代码点。

所以如果你意识到这些问题那么utfcpp就可以了,否则你就不会
找到更简单的东西。

  1. std::string is fine for UTF-8 storage.
  2. If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.

Take a look on Boost.Locale library (it uses ICU under the hood):

It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.

If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.

If you need an ability to iterate over Code points (that BTW are not characters)
take a look on http://utfcpp.sourceforge.net/

Answer to comment:

1) Find file formats for files included by me

std::string::find is perfectly fine for this.

2) Line break detection

This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)

And of course Boost.Locale does this and correctly.

And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.

3) Character (or now, code point) counting Looking at utfcpp thx

Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.

So if you are aware of these issues then utfcpp will be fine, otherwise you will not
find anything simpler.

不再见 2024-09-11 21:52:14

我从未使用过,但不久前偶然发现了这个 UTF-8 CPP 库,并且有足够的好感将其添加为书签。它以类似于 IIUC 的 BSD 许可证发布。

它仍然依赖 std::string 来处理字符串,并提供许多实用函数来帮助检查字符串是否确实是 UTF-8、计算字符数、后退或前进一个字符...它真的很小,只存在于头文件中:看起来真的很好!

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.

It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!

秋意浓 2024-09-11 21:52:14

您可能对 Björn 的灵活且经济的 UTF-8 解码器感兴趣Höhrmann,但它绝不是 std::string 的直接替代品。

You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文