C++ 的长度std::string 以字节为单位
我在弄清楚 std::string.length()
的确切语义时遇到了一些麻烦。 文档明确指出length()
返回字符串中的字符数,而不是字节数。我想知道在哪些情况下这实际上会产生影响。
特别是,这仅与 std::basic_string<> 的非 char 实例相关,还是在存储具有多字节字符的 UTF-8 字符串时也会遇到麻烦?标准是否允许 length()
识别 UTF8?
I'm having some trouble figuring out the exact semantics of std::string.length()
.
The documentation explicitly points out that length()
returns the number of characters in the string and not the number of bytes. I was wondering in which cases this actually makes a difference.
In particular, is this only relevant to non-char instantiations of std::basic_string<>
or can I also get into trouble when storing UTF-8 strings with multi-byte characters? Does the standard allow for length()
to be UTF8-aware?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在处理
std::basic_string<>
的非char
实例化时,当然,长度可能不等于字节数。这在std::wstring
中尤其明显:但是
std::string
是关于char
字符的;就std::string
而言,不存在多字节字符这样的东西,无论您是否在较高级别上塞入一个字符。因此,std::string.length() 始终是字符串表示的字节数。请注意,如果您将多字节“字符”塞进std::string
中,那么您对“字符”的定义突然与容器和标准的定义不一致。When dealing with non-
char
instantiations ofstd::basic_string<>
, sure, length may not equal number of bytes. This is particularly evident withstd::wstring
:But
std::string
is aboutchar
characters; there is no such thing as a multi-byte character as far asstd::string
is concerned, whether you crammed one in at a high level or not. So,std::string.length()
is always the number of bytes represented by the string. Note that if you're cramming multibyte "characters" into anstd::string
, then your definition of "character" suddenly becomes at odds with that of the container and of the standard.如果我们具体讨论
std::string
,那么length()
确实 返回字节数。这是因为
std::string
是char
的basic_string
,并且 C++ 标准定义了一个char< 的大小。 /code> 正好是一个字节。
请注意,标准没有说明一个字节有多少位,但这完全是另一个故事,您可能不在乎。
编辑:标准确实规定实现应提供 CHAR_BIT 的定义,该定义表示一个字节中有多少位。
顺便说一句,如果您确实关心一个字节有多少位,您可以考虑阅读此。
If we are talking specifically about
std::string
, thenlength()
does return the number of bytes.This is because a
std::string
is abasic_string
ofchar
s, and the C++ Standard defines the size of onechar
to be exactly one byte.Note that the Standard doesn't say how many bits are in a byte, but that's another story entirely and you probably don't care.
EDIT: The Standard does say that an implementation shall provide a definition for
CHAR_BIT
which says how many bits are in a byte.By the way, if you go down a road where you do care how many bits are in a byte, you might consider reading this.
std::string
是std::basic_string
,因此s.length() * sizeof(char) = 字节长度
。另外,std::string 对 UTF-8 一无所知,因此即使这不是您真正想要的,您也将获得字节大小。如果
std::string
中有 UTF-8 数据,则需要使用其他内容,例如 ICU 以获得“真实”长度。A
std::string
isstd::basic_string<char>
, sos.length() * sizeof(char) = byte length
. Also,std::string
knows nothing of UTF-8, so you're going to get the byte size even if that's not really what you're after.If you have UTF-8 data in a
std::string
, you'll need to use something else such as ICU to get the "real" length.cplusplus.com 不是
std::string
的“文档”,它是一个充满低质量信息的低质量网站。 C++ 标准定义得非常清楚:21.1 [strings.general] ¶1
<块引用>
本条款描述了用于操作任何非数组 POD (3.9) 类型序列的组件。在本子句中,此类类型称为类字符类型,类字符类型的对象称为类字符对象或简称为字符。
21.1
21.4.4 [字符串.容量] ¶1
<块引用>
size_type size() const noexcept;
返回:字符串中当前类字符对象的数量。
复杂性:恒定时间。
size_type length() const noexcept;
返回:
size()
21.4.4
cplusplus.com is not "the documentation" for
std::string
, it's a poor quality site full of poor quality information. The C++ standard defines it very clearly:21.1 [strings.general] ¶1
21.4.4 [string.capacity] ¶1