如何将 std::string 的实例转换为小写
我想将 std::string
转换为小写。 我知道函数tolower()
。 然而,过去我在使用这个函数时遇到了问题,而且它并不理想,因为将它与 std::string 一起使用需要迭代每个字符。
有没有 100% 有效的替代方案?
I want to convert a std::string
to lowercase. I am aware of the function tolower()
. However, in the past I have had issues with this function and it is hardly ideal anyway as using it with a std::string
would require iterating over each character.
Is there an alternative which works 100% of the time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
改编自不太常见的问题:
如果不迭代每个字符,你真的无法逃脱。 否则无法知道该字符是小写还是大写。
如果你真的讨厌
tolower()
,这里有一个专门的纯 ASCII 替代方案,我不建议您使用:请注意,
tolower()
只能进行单字节字符替换,这对于许多脚本来说是不合适的,尤其是使用 UTF-8 等多字节编码时。Adapted from Not So Frequently Asked Questions:
You're really not going to get away without iterating through each character. There's no way to know whether the character is lowercase or uppercase otherwise.
If you really hate
tolower()
, here's a specialized ASCII-only alternative that I don't recommend you use:Be aware that
tolower()
can only do a per-single-byte-character substitution, which is ill-fitting for many scripts, especially if using a multi-byte-encoding like UTF-8.Boost 为此提供了一个字符串算法 :
或者,对于非就地:
Boost provides a string algorithm for this:
Or, for non-in-place:
tl;dr
使用ICU 库。如果如果您不这样做,您的转换例程将在您可能甚至不知道存在的情况下悄然中断。
首先,您必须回答一个问题:您的
std::string
的编码是什么? 是 ISO-8859-1 吗? 或者也许是 ISO-8859-8? 或者 Windows 代码页 1252? 你用来转换大写到小写的东西知道这一点吗?(或者对于超过0x7f
的字符它会严重失败吗?)如果你使用的是 UTF-8 ( 8 位编码中唯一明智的选择)以
std::string
作为容器,如果您相信自己仍然可以控制事情,那么您就已经在欺骗自己了。 您将多字节字符序列存储在不了解多字节概念的容器中,并且您可以对其执行的大多数操作也不了解! 即使像.substr()
这样简单的东西也可能会导致无效(子)字符串,因为您在多字节序列的中间进行了分割。一旦您在任何中尝试类似
std::toupper( 'ß' )
或std::tolower( 'Σ' )
编码,你有麻烦了。 因为 1),该标准一次仅对一个字符进行操作,因此它根本无法将ß
转换为SS
,这是正确的。 2),该标准一次仅对一个字符进行操作,因此它无法确定Σ
是否位于单词的中间(其中σ
是正确的) ,或在末尾(ς
)。 另一个例子是std::tolower( 'I' )
,它应该会产生不同的结果取决于语言环境 - 几乎在您期望的任何地方i,但在土耳其,
ı
(拉丁文小写字母 DOTLESS I)是正确答案(同样,在 UTF-8 编码中它超过一个字节)。因此,每次作用于一个字符的任何大小写转换,或者更糟糕的是一次作用于字节的大小写转换,都被设计破坏了。包括目前存在的所有
std::
变体。还有一点是,标准库的功能取决于运行软件的计算机支持哪些语言环境...如果您的目标区域设置不属于您的客户端计算机支持的区域设置,您该怎么办?
因此,您真正寻找的是一个能够正确处理所有这些问题的字符串类,而不是任何
std: :basic_string<>
变体。(C++11 注意:
std::u16string
和std::u32string
更好,但仍然不完美。C++20 带来了std::u8string
,但所有这些所做的只是指定编码,但在许多其他方面,它们仍然对 Unicode 机制一无所知,例如规范化、排序规则……)。 Boost看起来不错,就 API 而言,Boost.Locale 基本上是 ICU 的包装器。 如果 Boost 在 ICU 支持下编译...如果不是,Boost.Locale 仅限于为标准库编译的区域设置支持。
相信我,让 Boost 与 ICU 一起编译有时会很痛苦。 (Windows 没有包含 ICU 的预编译二进制文件,因此您必须将它们与您的应用程序一起提供,这会打开一个全新的蠕虫罐......)
所以就我个人而言,我建议直接从马口中获得完整的 Unicode 支持并直接使用 ICU 库:
编译(使用 G++本示例):
这给出:
请注意,Σ<->σ 转换位于单词中间,Σ<->ς 转换位于单词末尾。 没有基于
的解决方案可以为您提供这一点。tl;dr
Use the ICU library. If you don't, your conversion routine will break silently on cases you are probably not even aware of existing.
First you have to answer a question: What is the encoding of your
std::string
? Is it ISO-8859-1? Or perhaps ISO-8859-8? Or Windows Codepage 1252? Does whatever you're using to convert upper-to-lowercase know that? (Or does it fail miserably for characters over0x7f
?)If you are using UTF-8 (the only sane choice among the 8-bit encodings) with
std::string
as container, you are already deceiving yourself if you believe you are still in control of things. You are storing a multibyte character sequence in a container that is not aware of the multibyte concept, and neither are most of the operations you can perform on it! Even something as simple as.substr()
could result in invalid (sub-) strings because you split in the middle of a multibyte sequence.As soon as you try something like
std::toupper( 'ß' )
, orstd::tolower( 'Σ' )
in any encoding, you are in trouble. Because 1), the standard only ever operates on one character at a time, so it simply cannot turnß
intoSS
as would be correct. And 2), the standard only ever operates on one character at a time, so it cannot decide whetherΣ
is in the middle of a word (whereσ
would be correct), or at the end (ς
). Another example would bestd::tolower( 'I' )
, which should yield different results depending on the locale -- virtually everywhere you would expecti
, but in Turkeyı
(LATIN SMALL LETTER DOTLESS I) is the correct answer (which, again, is more than one byte in UTF-8 encoding).So, any case conversion that works on a character at a time, or worse, a byte at a time, is broken by design. This includes all the
std::
variants in existence at this time.Then there is the point that the standard library, for what it is capable of doing, is depending on which locales are supported on the machine your software is running on... and what do you do if your target locale is among the not supported on your client's machine?
So what you are really looking for is a string class that is capable of dealing with all this correctly, and that is not any of the
std::basic_string<>
variants.(C++11 note:
std::u16string
andstd::u32string
are better, but still not perfect. C++20 broughtstd::u8string
, but all these do is specify the encoding. In many other respects they still remain ignorant of Unicode mechanics, like normalization, collation, ...)While Boost looks nice, API wise, Boost.Locale is basically a wrapper around ICU. If Boost is compiled with ICU support... if it isn't, Boost.Locale is limited to the locale support compiled for the standard library.
And believe me, getting Boost to compile with ICU can be a real pain sometimes. (There are no pre-compiled binaries for Windows that include ICU, so you'd have to supply them together with your application, and that opens a whole new can of worms...)
So personally I would recommend getting full Unicode support straight from the horse's mouth and using the ICU library directly:
Compile (with G++ in this example):
This gives:
Note that the Σ<->σ conversion in the middle of the word, and the Σ<->ς conversion at the end of the word. No
<algorithm>
-based solution can give you that.使用 C++11 的基于范围的 for 循环,更简单的代码是:
Using range-based for loop of C++11 a simpler code would be :
另一种方法是使用基于范围的 for 循环和引用变量
Another approach using range based for loop with reference variable
如果字符串包含 ASCII 范围之外的 UTF-8 字符,则 boost::algorithm::to_lower 将不会转换这些字符。 当涉及 UTF-8 时,最好使用 boost::locale::to_lower 。 请参阅http://www.boost.org/doc/libs/1_51_0 /libs/locale/doc/html/conversions.html
If the string contains UTF-8 characters outside of the ASCII range, then boost::algorithm::to_lower will not convert those. Better use boost::locale::to_lower when UTF-8 is involved. See http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html
这是 Stefan Mai 回复的后续内容:如果您想将转换结果放入另一个字符串中,则需要在调用
std::transform
之前预先分配其存储空间。 由于 STL 将转换后的字符存储在目标迭代器中(在循环的每次迭代中递增它),因此目标字符串不会自动调整大小,并且存在内存踩踏的风险。This is a follow-up to Stefan Mai's response: if you'd like to place the result of the conversion in another string, you need to pre-allocate its storage space prior to calling
std::transform
. Since STL stores transformed characters at the destination iterator (incrementing it at each iteration of the loop), the destination string will not be automatically resized, and you risk memory stomping.将字符串转换为小写字母而不用担心 std 命名空间的最简单方法如下
1:带/不带空格的字符串
2:不带空格的字符串
Simplest way to convert string into loweercase without bothering about std namespace is as follows
1:string with/without spaces
2:string without spaces
我写了这个简单的辅助函数:
用法:
I wrote this simple helper function:
Usage:
我自己的模板函数执行大写/小写。
My own template functions which performs upper / lower case.
标准 C++ 本地化库中的
std::ctype::tolower()
将为您正确执行此操作。 以下是从 tolower 参考页面中提取的示例std::ctype::tolower()
from the standard C++ Localization library will correctly do this for you. Here is an example extracted from the tolower reference pageBoost 的替代方案是 POCO (pocoproject.org)。
POCO 提供两种变体:
“In Place”版本的名称中始终带有“InPlace”。
两个版本的演示如下:
An alternative to Boost is POCO (pocoproject.org).
POCO provides two variants:
"In Place" versions always have "InPlace" in the name.
Both versions are demonstrated below:
由于没有一个答案提到即将推出的 Ranges 库,该库自 C++20 起就在标准库中可用,并且当前单独可用 在 GitHub 上作为
range-v3
,我想添加一种使用它来执行此转换的方法。就地修改字符串:
生成新字符串:(
不要忘记
#include
和所需的 Ranges 标头。)注意:使用
unsigned char
因为 lambda 的参数受到 cppreference,其中指出:Since none of the answers mentioned the upcoming Ranges library, which is available in the standard library since C++20, and currently separately available on GitHub as
range-v3
, I would like to add a way to perform this conversion using it.To modify the string in-place:
To generate a new string:
(Don't forget to
#include <cctype>
and the required Ranges headers.)Note: the use of
unsigned char
as the argument to the lambda is inspired by cppreference, which states:在 Microsoft 平台上,您可以使用
strlwr
系列函数: http://msdn.microsoft.com/en-us/library/hkxwh33z.aspxOn microsoft platforms you can use the
strlwr
family of functions: http://msdn.microsoft.com/en-us/library/hkxwh33z.aspx有一种方法可以将大写字母转换为小写字母无需进行 if 测试,而且非常简单。 isupper() 函数/宏对 clocale.h 的使用应该可以解决与您的位置相关的问题,但如果没有,您可以随时根据自己的喜好调整 UtoL[]。
鉴于 C 的字符实际上只是 8 位整数(暂时忽略宽字符集),您可以创建一个 256 字节数组来保存一组替代字符,并在转换函数中使用字符串中的字符作为下标到转换数组。
不过,不要采用 1 对 1 映射,而是为大写数组成员提供小写字符的 BYTE int 值。 您可能会发现 islower() 和 isupper() 在这里很有用。
代码如下所示...
这种方法同时允许您重新映射任何您想要更改的其他字符。
当在现代处理器上运行时,这种方法有一个巨大的优势,不需要进行分支预测,因为没有包含分支的 if 测试。 这可以节省 CPU 的分支预测逻辑用于其他循环,并且可以防止管道停顿。
有些人可能会认为这种方法与用于将 EBCDIC 转换为 ASCII 的方法相同。
There is a way to convert upper case to lower WITHOUT doing if tests, and it's pretty straight-forward. The isupper() function/macro's use of clocale.h should take care of problems relating to your location, but if not, you can always tweak the UtoL[] to your heart's content.
Given that C's characters are really just 8-bit ints (ignoring the wide character sets for the moment) you can create a 256 byte array holding an alternative set of characters, and in the conversion function use the chars in your string as subscripts into the conversion array.
Instead of a 1-for-1 mapping though, give the upper-case array members the BYTE int values for the lower-case characters. You may find islower() and isupper() useful here.
The code looks like this...
This approach will, at the same time, allow you to remap any other characters you wish to change.
This approach has one huge advantage when running on modern processors, there is no need to do branch prediction as there are no if tests comprising branching. This saves the CPU's branch prediction logic for other loops, and tends to prevent pipeline stalls.
Some here may recognize this approach as the same one used to convert EBCDIC to ASCII.
否
在选择小写方法之前,您需要问自己几个问题。
一旦您找到了这些问题的答案,您就可以开始寻找适合您需求的解决方案。 没有一种方法适合所有地方的所有人!
No
There are several questions you need to ask yourself before choosing a lowercasing method.
Once you have answers to those questions you can start looking for a soloution that fits your needs. There is no one size fits all that works for everyone everywhere!
C++ 没有为
std::string
实现tolower
或toupper
方法,但可用于char
。 人们可以轻松地读取字符串中的每个字符,将其转换为所需的大小写并将其放回字符串中。不使用任何第三方库的示例代码:
对于字符串上基于字符的操作: 对于字符串中的每个字符
C++ doesn't have
tolower
ortoupper
methods implemented forstd::string
, but it is available forchar
. One can easily read each char of string, convert it into required case and put it back into string.A sample code without using any third party library:
For character based operation on string : For every character in string
如果您想要简单的东西,这里有一个宏观技术:
但是,请注意@AndreasSpindler对这个答案的评论仍然是一个重要的考虑因素,但是,如果您正在处理的内容不只是 ASCII 字符。
Here's a macro technique if you want something simple:
However, note that @AndreasSpindler's comment on this answer still is an important consideration, however, if you're working on something that isn't just ASCII characters.
欲了解更多信息: http://www.cplusplus.com/reference/locale/tolower/< /a>
For more information: http://www.cplusplus.com/reference/locale/tolower/
试试这个功能:)
Try this function :)
看看优秀的 c++17 cpp-unicodelib (GitHub)。 它是单文件且仅包含标头。
输出
Have a look at the excellent c++17 cpp-unicodelib (GitHub). It's single-file and header-only.
Output
关于此解决方案如何工作的说明:
说明:
for(auto& c : test )
是一个基于范围的 for 循环 kindfor (
range_declaration
:
range_expression
)
loop_statement
:range_declaration
:auto& c
这里的 auto 说明符 用于自动类型推导。 因此,类型会从变量初始值设定项中扣除。
范围表达式
:测试
本例中的范围是字符串
test
的字符。字符串
test
的字符可通过标识符c
作为 for 循环内的引用。An explanation of how this solution works:
Explanation:
for(auto& c : test)
is a range-based for loop of the kindfor (
range_declaration
:
range_expression
)
loop_statement
:range_declaration
:auto& c
Here the auto specifier is used for for automatic type deduction. So the type gets deducted from the variables initializer.
range_expression
:test
The range in this case are the characters of string
test
.The characters of the string
test
are available as a reference inside the for loop through identifierc
.使用 fplus 库中的
fplus::to_lower_case()
。在 fplus API 搜索中搜索
to_lower_case
示例:
Use
fplus::to_lower_case()
from fplus library.Search
to_lower_case
in fplus API SearchExample:
Google 的
absl
库有absl::AsciiStrToLower
/absl::AsciiStrToUpper
Google's
absl
library hasabsl::AsciiStrToLower
/absl::AsciiStrToUpper
由于您使用的是 std::string,因此您正在使用 c++。 如果使用 c++11 或更高版本,则不需要任何花哨的东西。 如果
words
是vector
,则:没有奇怪的异常。 可能想要使用 w_char,但否则这应该完成所有工作。
Since you are using std::string, you are using c++. If using c++11 or higher, this doesn't need anything fancy. If
words
isvector<string>
, then:Doesn't have strange exceptions. Might want to use w_char's but otherwise this should do it all in place.
从不同的角度来看,有一个非常常见的用例,即对 Unicode 字符串执行语言环境中立大小写折叠。 对于这种情况,当您意识到可折叠字符集是有限且相对较小(< 2000 个 Unicode 代码点)时,可以获得良好的大小写折叠性能。 它恰好与生成的完美散列(保证零冲突)配合得很好,可用于将每个输入字符转换为其小写等效字符。
使用 UTF-8,您必须认真对待多字节字符并进行相应的迭代。 然而,UTF-8 具有相当简单的编码规则,使得此操作高效。
有关更多详细信息,包括指向 Unicode 标准相关部分和完美哈希生成器的链接,请参阅我的答案 此处,针对问题如何在 C++ 中实现与 unicode 无关的大小写不敏感比较。
For a different perspective, there is a very common use case which is to perform locale neutral case folding on Unicode strings. For this case, it is possible to get good case folding performance when you realize that the set of foldable characters is finite and relatively small (< 2000 Unicode code points). It happens to work very well with a generated perfect hash (guaranteed zero collisions) can be used to convert every input character to its lowercase equivalent.
With UTF-8, you do have to be conscientious of multi-byte characters and iterate accordingly. However, UTF-8 has fairly simple encoding rules that make this operation efficient.
For more details, including links to the relevant parts of the Unicode standard and a perfect hash generator, see my answer here, to the question How to achieve unicode-agnostic case insensitive comparison in C++.
代码片段
Code Snippet
为 ASCII 字符串 to_lower 添加一些可选库,这两个库都是生产级别的并且具有微优化,预计比此处现有的答案更快(TODO:添加基准结果)。
Facebook 的 愚蠢:
Google 的 绕绳:
Add some optional libraries for ASCII string to_lower, both of which are production level and with micro-optimizations, which is expected to be faster than the existed answers here(TODO: add benchmark result).
Facebook's Folly:
Google's Abseil:
我编写了一个适用于任何字符串的模板化版本:
使用 gcc 编译器进行测试:
输出:
I wrote a templated version that works with any string :
Tested with gcc compiler:
output:
这可能是另一个简单的版本,用于将大写字母转换为小写字母,反之亦然。 我使用VS2017社区版来编译这个源代码。
注意:如果有特殊字符则需要使用条件检查来处理。
This could be another simple version to convert uppercase to lowercase and vice versa. I used VS2017 community version to compile this source code.
Note: if there are special characters then need to be handled using condition check.