如何在 C++ 中标记字符串?
Java 有一个方便的 split 方法:
String str = "The quick brown fox";
String[] results = str.split(" ");
在 C++ 中是否有一种简单的方法可以做到这一点?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
我知道您要求使用 C++ 解决方案,但您可能会认为这很有帮助:
Qt
在此示例中,相对于 Boost 的优势在于它是到您帖子代码的直接一对一映射。
如需了解更多信息,请访问 Qt 文档
I know you asked for a C++ solution, but you might consider this helpful:
Qt
The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.
See more at Qt documentation
这是一个示例标记生成器类,可能会执行您想要的操作
示例:
Here is a sample tokenizer class that might do what you want
Example:
pystring 是一个小型库,它实现了一堆 Python 的字符串函数,包括 split 方法:
pystring is a small library which implements a bunch of Python's string functions, including the split method:
如果您使用的是 C++ 范围 - 完整的 ranges-v3 库,而不是接受的有限功能进入 C++20 - 你可以这样做:
...这是惰性评估的。您也可以将向量设置为此范围:
如果
str
有 n 个字符组成 m 个单词,则这将占用 O(m) 空间和 O(n) 时间。另请参阅库自己的标记化示例,此处 。
If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:
... and this is lazily-evaluated. You can alternatively set a vector to this range:
this will take O(m) space and O(n) time if
str
has n characters making up m words.See also the library's own tokenization example, here.
Adam Pierce 的答案提供了一个手工分词器,它接受
const char*
。使用迭代器会出现一些问题,因为递增字符串
的结束迭代器未定义 。也就是说,给定string str{ "The Quick Brown Fox" }
我们当然可以实现这一点:实例
如果您希望通过使用标准功能来抽象复杂性,正如On Freund建议的
strtok
是一个简单的选项:如果您无法访问 C++17,您需要替换
data(str)
,如下例所示:http://ideone.com/8kAGoa虽然示例中没有演示,但
strtok
不需要为每个标记使用相同的分隔符。尽管有这个优点,但也有几个缺点:strtok
不能同时用于多个string
:必须传递nullptr
要继续标记当前的string
或必须传递新的char*
来标记(但是有一些非标准实现确实支持此操作,例如:strtok_s
)strtok
不能同时在多个线程上使用(但这可能是实现定义的,例如:Visual Studio 的实现是线程安全的)strtok
修改它正在操作的string
,因此它不能用于const string
、const char*
或文字字符串,以使用strtok
对其中任何一个进行标记,或者对需要保留其内容的string
进行操作,必须复制str
,然后复制 上操作可以在c++20 为我们提供了
split_view
以非破坏性方式标记字符串:https:// topanswers.xyz/cplusplus?q=749#a874之前的方法无法就地生成标记化的
向量
,这意味着如果不将它们抽象为辅助函数,它们就无法初始化const 向量<字符串>;令牌
。可以使用istream_iterator
。例如给定:const string str{ "The Quick \tbrown \nfox" }
我们可以这样做:实例
此选项所需的
istringstream
构建成本远高于前 2 个选项,但是此成本通常隐藏在字符串
分配。如果上述选项都不足以满足您的标记化需求,最灵活的选项是使用 当然,这种灵活性会带来更大的费用,但这很可能隐藏在字符串分配成本中。举例来说,我们想要基于非转义逗号进行标记,同时也吃掉空格,给定以下输入:
const string str{ "The ,qu\\,ick ,\tbrown, Fox" }
我们可以这样做:实例
Adam Pierce's answer provides an hand-spun tokenizer taking in a
const char*
. It's a bit more problematic to do with iterators because incrementing astring
's end iterator is undefined. That said, givenstring str{ "The quick brown fox" }
we can certainly accomplish this:Live Example
If you're looking to abstract complexity by using standard functionality, as On Freund suggests
strtok
is a simple option:If you don't have access to C++17 you'll need to substitute
data(str)
as in this example: http://ideone.com/8kAGoaThough not demonstrated in the example,
strtok
need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:strtok
cannot be used on multiplestrings
at the same time: Either anullptr
must be passed to continue tokenizing the currentstring
or a newchar*
to tokenize must be passed (there are some non-standard implementations which do support this however, such as:strtok_s
)strtok
cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)strtok
modifies thestring
it is operating on, so it cannot be used onconst string
s,const char*
s, or literal strings, to tokenize any of these withstrtok
or to operate on astring
who's contents need to be preserved,str
would have to be copied, then the copy could be operated onc++20 provides us with
split_view
to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874The previous methods cannot generate a tokenized
vector
in-place, meaning without abstracting them into a helper function they cannot initializeconst vector<string> tokens
. That functionality and the ability to accept any white-space delimiter can be harnessed using anistream_iterator
. For example given:const string str{ "The quick \tbrown \nfox" }
we can do this:Live Example
The required construction of an
istringstream
for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense ofstring
allocation.If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a
regex_token_iterator
of course with this flexibility comes greater expense, but again this is likely hidden in thestring
allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input:const string str{ "The ,qu\\,ick ,\tbrown, fox" }
we can do this:Live Example
我针对类似问题发布了此答案。
不要重新发明轮子。我使用过许多库,我遇到的最快、最灵活的是: C++ String Toolkit图书馆。
这是我在 stackoverflow 上其他地方发布的如何使用它的示例。
I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.
Here is an example of how to use it that I've posted else where on the stackoverflow.
检查这个例子。它可能会帮助你..
Check this example. It might help you..
MFC/ATL 有一个非常好的分词器。来自 MSDN:
MFC/ATL has a very nice tokenizer. From MSDN:
如果您愿意使用 C,则可以使用 strtok功能。使用时要注意多线程问题。
If you're willing to use C, you can use the strtok function. You should pay attention to multi-threading issues when using it.
对于简单的东西,我只使用以下内容:
胆怯的免责声明:我编写实时数据处理软件,其中数据通过二进制文件、套接字或某些 API 调用(I/O 卡、相机)传入。我从来没有使用此功能来处理比在启动时读取外部配置文件更复杂或时间紧迫的事情。
For simple stuff I just use the following:
Cowardly disclaimer: I write real-time data processing software where the data comes in through binary files, sockets, or some API call (I/O cards, camera's). I never use this function for something more complicated or time-critical than reading external configuration files on startup.
您可以简单地使用正则表达式库并使用正则表达式解决该问题。
使用表达式 (\w+) 和 \1 中的变量(或 $1,具体取决于正则表达式的库实现)。
You can simply use a regular expression library and solve that using regular expressions.
Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).
这里有许多过于复杂的建议。尝试这个简单的 std::string 解决方案:
Many overly complicated suggestions here. Try this simple std::string solution:
我认为这就是字符串流上的
>>
运算符的用途:I thought that was what the
>>
operator on string streams was for:我知道这个问题已经得到解答,但我想做出贡献。也许我的解决方案有点简单,但这就是我想到的:
如果我的代码中有更好的方法或者有问题,请发表评论。
更新:添加了通用分隔符
I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:
Please comment if there is a better approach to something in my code or if something is wrong.
UPDATE: added generic separator
这是一种允许您控制是否包含空标记(如 strsep)或排除空标记(如 strtok)的方法。
Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).
对我来说,这似乎很奇怪,对于我们所有注重速度的书呆子来说,没有人提出一个使用编译时生成的查找表作为分隔符的版本(下面的示例实现)。使用查找表和迭代器应该在效率上击败 std::regex ,如果您不需要击败正则表达式,只需使用它,它是 C++11 的标准并且超级灵活。
有些人已经建议使用正则表达式,但对于菜鸟来说,这里是一个打包的示例,它应该完全符合OP的期望:
如果我们需要更快并接受所有字符必须是8位的约束,我们可以在编译时创建一个查找表使用元编程:
有了它,制作
getNextToken
函数就很容易:使用它也很容易:
这是一个实例:http://ideone.com/GKtkLQ
Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.
Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:
If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:
With that in place making a
getNextToken
function is easy:Using it is also easy:
Here is a live example: http://ideone.com/GKtkLQ
您可以利用 boost::make_find_iterator。与此类似的东西:
you can take advantage of boost::make_find_iterator. Something similar to this:
这是我的字符串标记器 Swiss® Army Knife,用于按空格分割字符串、解释单引号和双引号包裹的字符串以及从结果中剥离这些字符。我使用 RegexBuddy 4.x 生成大部分代码片段,但我添加了用于剥离引号和其他一些内容的自定义处理。
Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.
我编写了 https://stackoverflow.com/a/50247503/3976739 供我自己使用。我希望它会有所帮助。
I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.
我刚刚阅读了所有答案,但无法找到下一个先决条件的解决方案:
,所以这是我的解决方案
I just read all the answers and can't find solution with next preconditions:
So here is my solution
Boost tokenizer 类可以使这种事情变得相当简单简单:
针对 C++11 进行了更新:
The Boost tokenizer class can make this sort of thing quite simple:
Updated for C++11:
这是一个非常简单的:
Here's a real simple one:
C++ 标准库算法普遍基于迭代器而不是具体容器。不幸的是,这使得在 C++ 标准库中提供类似 Java 的 split 函数变得很困难,尽管没有人认为这会很方便。但它的返回类型是什么?
std::vector>
?也许吧,但随后我们被迫执行(可能是多余且昂贵的)分配。相反,C++ 提供了大量基于任意复杂分隔符分割字符串的方法,但它们都没有像其他语言那样封装得很好。多种方式填充整个博客文章< /a>.
最简单的是,您可以使用
std::string::find< 进行迭代/code>
直到您点击
std::string::npos
,并使用std::string::substr
。用于按空格分割的更流畅(且惯用,但基本)的版本将使用
std::istringstream
:使用
std::istream_iterator
s,字符串流的内容也可以使用其迭代器范围构造函数复制到向量中。多个库(例如 Boost.Tokenizer)提供特定的标记器。
更高级的分割需要正则表达式。为此,C++ 提供了
std::regex_token_iterator
尤其:C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like
split
function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be?std::vector<std::basic_string<…>>
? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.
At its simplest, you could iterate using
std::string::find
until you hitstd::string::npos
, and extract the contents usingstd::string::substr
.A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a
std::istringstream
:Using
std::istream_iterator
s, the contents of the string stream could also be copied into a vector using its iterator range constructor.Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.
More advanced splitting require regular expressions. C++ provides the
std::regex_token_iterator
for this purpose in particular:另一种快速方法是使用 getline。比如:
如果你愿意,你可以创建一个简单的
split()
方法,返回一个std::vector
,这非常有用。Another quick way is to use
getline
. Something like:If you want, you can make a simple
split()
method returning astd::vector<string>
, which is really useful.使用strtok。在我看来,没有必要围绕标记化构建一个类,除非 strtok 不能为您提供所需的内容。也许不会,但在用 C 和 C++ 编写各种解析代码的 15 年多的时间里,我一直使用 strtok。这是一个示例
一些注意事项(可能不适合您的需求)。该字符串在此过程中被“破坏”,这意味着 EOS 字符被内联放置在分隔符位置中。正确的使用可能需要您创建字符串的非常量版本。您还可以在解析过程中更改分隔符列表。
在我看来,上面的代码比为其编写一个单独的类要简单得多,也更容易使用。对我来说,这是该语言提供的功能之一,而且它做得很好、很干净。它只是一个“基于C”的解决方案。它很合适,很简单,而且您不必编写很多额外的代码:-)
Use strtok. In my opinion, there isn't a need to build a class around tokenizing unless strtok doesn't provide you with what you need. It might not, but in 15+ years of writing various parsing code in C and C++, I've always used strtok. Here is an example
A few caveats (which might not suit your needs). The string is "destroyed" in the process, meaning that EOS characters are placed inline in the delimter spots. Correct usage might require you to make a non-const version of the string. You can also change the list of delimiters mid parse.
In my own opinion, the above code is far simpler and easier to use than writing a separate class for it. To me, this is one of those functions that the language provides and it does it well and cleanly. It's simply a "C based" solution. It's appropriate, it's easy, and you don't have to write a lot of extra code :-)
您可以使用流、迭代器和复制算法来相当直接地完成此操作。
You can use streams, iterators, and the copy algorithm to do this fairly directly.
使用 regex_token_iterator 的解决方案:
A solution using
regex_token_iterator
s:无意冒犯各位,但是对于这样一个简单的问题,你把事情搞得太复杂了。使用 Boost 的原因有很多。但对于这么简单的事情,就像用 20# 雪橇打苍蝇一样。
例如(对于 Doug 的情况),
是的,我们可以让 split() 返回一个新向量,而不是传入一个向量。包装和重载很简单。但根据我正在做的事情,我经常发现重用预先存在的对象比总是创建新对象更好。 (只要我不要忘记清空中间的向量!)
参考: http://www.cplusplus.com/reference/string/string/。
(我最初是在写对 Doug 问题的回复:C++ 字符串修改和提取基于分隔符(已关闭),但是自从 Martin York 用此处的指针结束了该问题...我将概括我的代码。)
No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.
For example (for Doug's case),
And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)
Reference: http://www.cplusplus.com/reference/string/string/.
(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)
Boost具有强大的分割功能:boost::algorithm::split。
示例程序:
输出:
Boost has a strong split function: boost::algorithm::split.
Sample program:
Output:
这是一个简单的仅 STL 解决方案(约 5 行!),使用
std::find
和std::find_first_not_of
处理分隔符的重复(例如空格或句点)实例),以及前导和尾随分隔符:现场尝试一下!
This is a simple STL-only solution (~5 lines!) using
std::find
andstd::find_first_not_of
that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:Try it out live!