C++ Unix / Mac OS X 上的 UTF-8 支持
我需要一种方法能够在 C++ 中读取 UTF-8 编码文件并将数据存储到某种“UTF-8 兼容字符串”中。稍后需要将此数据写回 UTF-8 编码文件。谷歌上似乎有很多关于在 Windows 中执行此操作的建议,但我找不到针对 Unix 系统的任何帮助。
感谢您的帮助!
I need a way to be able to read from a UTF-8 encoded file and store data from it into "UTF-8 compatible strings" of some sort, in C++. This data needs to be written back to a UTF-8 encoded file later on. There seems to be a lot of advice on google about doing this in Windows but I cannot find any help for Unix systems.
Thanks for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您需要做的只是读取和写入它,那么 std::string 就可以了。
这是可行的,因为没有多字符 UTF 代码点与 ASCII 字符重叠,因此文本的标准处理相对于行尾序列工作得很好,并且流没有进行其他处理。你读到的就是你得到的。输出字符串不会更改任何代码点。
现在,如果您需要操作文本,这是一个不同的问题并且变得更加复杂。
通常操纵 UTF-8 是很困难的(可以做到,但在我看来不值得)。
当涉及到操作文本时,您希望将 UTF-8(不是固定宽度)转换为内部固定宽度格式; (UTF-16 或 UTF-32 是常用的操作格式,并且易于使用;(UTF-16 windows,UTF-32 适用于大多数 *nix 等操作系统))。最简单的方法是向流中注入一个 Facet,该 Facet 知道输入采用 UTF-8 格式,并会自动对其进行转换。
在不同的库中存在一些这样的方面。但一个很容易找到的就是 boost:
http:// www.boost.org/doc/libs/1_38_0/libs/serialization/doc/codecvt.html
注意:它也在最新版本的 boost 1.46
这些过程与将 UTF-16/32 写回流并将其转换为UTF-8
注意。您应该在打开文件之前将其注入。如果您在打开流后注入流,则流的不同实现会有不同的反应。因此,最好在打开流之前先将其注入。
Dinkumware 也有一组转换方面(不确定它们是否免费)。
http://www.dinkumware.com /manuals/default.aspx?manual=compleat&page=index_cvt.html#Code%20Conversions
注意:我更喜欢使用术语 UTF-X 而不是 UCS-Y。尽管从技术上讲存在非常微小的差异,但与在谈论该主题时在两个术语之间切换所造成的混乱相比,这些差异是微不足道的。除非您需要明确谈论某个功能(例如代理对),否则请坚持使用其中一个。
If all you need to do is read and write it then std::string is fine.
This works because no multi-character UTF codepoint overlaps with an an ASCII character so the standard processing of text works just fine in relation to end of line sequence and there is no other processing done by the stream. What you read is what you get. Outputting the string does not change any of the codepoints.
Now if you need to manipulate the text that is a different question and gets more complex.
Usually manipulating UTF-8 is way to hard (can be done but not worth it IMO).
When it comes to manipulating the text you want to convert the UTF-8 (which is not a fixed width) to an internal fixed width format; (UTF-16 or UTF-32 are common formats for manipulation and easy to use; (UTF-16 windows, UTF-32 for most *nix like OS)). The easiest way to do this is to imbue the stream with a facet that knows the input is in UTF-8 and will convert it automatically.
There are a couple of these facets floating around in different libraries. But an easy one to find is boost:
http://www.boost.org/doc/libs/1_38_0/libs/serialization/doc/codecvt.html
Note: It is also in the latest version of boost 1.46
The processes is the same for writting UTF-16/32 back to a stream and converting it to UTF-8
Note. You should imbue the file before it is opened. Different implementations of the stream will react differently if you imbue the stream after it is open. Thus it is best to imbue the stream before opening it.
Dinkumware also has a set of conversion facets (not sure if they are free).
http://www.dinkumware.com/manuals/default.aspx?manual=compleat&page=index_cvt.html#Code%20Conversions
Note: I prefer to use the terms UTF-X rather than UCS-Y. Though technically there are very minor differences these are inconsequential compared to the confusion you can create by switching between the two terms while talking about the subject. Stick to one unless you need to talk explicitly about a feature (like Surrogate pairs).