通过多个分隔符将字符串拆分为单词

发布于 2024-12-07 20:32:18 字数 264 浏览 0 评论 0原文

我有一些文本(有意义的文本或算术表达式),我想将其拆分为单词。
如果我有一个分隔符,我会使用:

std::stringstream stringStream(inputString);
std::string word;
while(std::getline(stringStream, word, delimiter)) 
{
    wordVector.push_back(word);
}

如何将字符串分解为带有多个分隔符的标记?

I have some text (meaningful text or arithmetical expression) and I want to split it into words.
If I had a single delimiter, I'd use:

std::stringstream stringStream(inputString);
std::string word;
while(std::getline(stringStream, word, delimiter)) 
{
    wordVector.push_back(word);
}

How can I break the string into tokens with several delimiters?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

时常饿 2024-12-14 20:32:18

假设分隔符之一是换行符,则以下内容读取该行并进一步按分隔符分割它。在本示例中,我选择了分隔符空格、撇号和分号。

std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line)) 
{
    std::size_t prev = 0, pos;
    while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
    {
        if (pos > prev)
            wordVector.push_back(line.substr(prev, pos-prev));
        prev = pos+1;
    }
    if (prev < line.length())
        wordVector.push_back(line.substr(prev, std::string::npos));
}

Assuming one of the delimiters is newline, the following reads the line and further splits it by the delimiters. For this example I've chosen the delimiters space, apostrophe, and semi-colon.

std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line)) 
{
    std::size_t prev = 0, pos;
    while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
    {
        if (pos > prev)
            wordVector.push_back(line.substr(prev, pos-prev));
        prev = pos+1;
    }
    if (prev < line.length())
        wordVector.push_back(line.substr(prev, std::string::npos));
}
止于盛夏 2024-12-14 20:32:18

如果你有提升,你可以使用:

#include <boost/algorithm/string.hpp>
std::string inputString("One!Two,Three:Four");
std::string delimiters("|,:");
std::vector<std::string> parts;
boost::split(parts, inputString, boost::is_any_of(delimiters));

If you have boost, you could use:

#include <boost/algorithm/string.hpp>
std::string inputString("One!Two,Three:Four");
std::string delimiters("|,:");
std::vector<std::string> parts;
boost::split(parts, inputString, boost::is_any_of(delimiters));
歌枕肩 2024-12-14 20:32:18

使用 std::regex

std::regex 可以在几行中进行字符串分割:

std::regex re("[\\|,:]");
std::sregex_token_iterator first{input.begin(), input.end(), re, -1}, last;//the '-1' is what makes the regex split (-1 := what was not matched)
std::vector<std::string> tokens{first, last};

自己尝试一下

Using std::regex

A std::regex can do string splitting in a few lines:

std::regex re("[\\|,:]");
std::sregex_token_iterator first{input.begin(), input.end(), re, -1}, last;//the '-1' is what makes the regex split (-1 := what was not matched)
std::vector<std::string> tokens{first, last};

Try it yourself

固执像三岁 2024-12-14 20:32:18

我不知道为什么没有人指出手动方式,但它是:

const std::string delims(";,:. \n\t");
inline bool isDelim(char c) {
    for (int i = 0; i < delims.size(); ++i)
        if (delims[i] == c)
            return true;
    return false;
}

并且在功能上:

std::stringstream stringStream(inputString);
std::string word; char c;

while (stringStream) {
    word.clear();

    // Read word
    while (!isDelim((c = stringStream.get()))) 
        word.push_back(c);
    if (c != EOF)
        stringStream.unget();

    wordVector.push_back(word);

    // Read delims
    while (isDelim((c = stringStream.get())));
    if (c != EOF)
        stringStream.unget();
}

这样,如果您愿意,您可以使用 delims 做一些有用的事情。

I don't know why nobody pointed out the manual way, but here it is:

const std::string delims(";,:. \n\t");
inline bool isDelim(char c) {
    for (int i = 0; i < delims.size(); ++i)
        if (delims[i] == c)
            return true;
    return false;
}

and in function:

std::stringstream stringStream(inputString);
std::string word; char c;

while (stringStream) {
    word.clear();

    // Read word
    while (!isDelim((c = stringStream.get()))) 
        word.push_back(c);
    if (c != EOF)
        stringStream.unget();

    wordVector.push_back(word);

    // Read delims
    while (isDelim((c = stringStream.get())));
    if (c != EOF)
        stringStream.unget();
}

This way you can do something useful with the delims if you want.

似狗非友 2024-12-14 20:32:18

如果您对如何自己做而不是使用 boost 感兴趣。

假设分隔符字符串可能很长 - 假设为 M,检查字符串中的每个字符是否为分隔符,每个字符的成本为 O(M),因此在循环中对原始字符串中的所有字符执行此操作,假设长度为 N,是 O(M*N)。

我会使用字典(就像映射 - “分隔符”到“布尔值” - 但在这里我会使用一个简单的布尔数组,每个分隔符的 index = ascii 值都为 true)。

现在迭代字符串并检查 char 是否是分隔符,时间复杂度为 O(1),最终总体上为 O(N)。

这是我的示例代码:

const int dictSize = 256;    

vector<string> tokenizeMyString(const string &s, const string &del)
{
    static bool dict[dictSize] = { false};

    vector<string> res;
    for (int i = 0; i < del.size(); ++i) {      
        dict[del[i]] = true;
    }

    string token("");
    for (auto &i : s) {
        if (dict[i]) {
            if (!token.empty()) {
                res.push_back(token);
                token.clear();
            }           
        }
        else {
            token += i;
        }
    }
    if (!token.empty()) {
        res.push_back(token);
    }
    return res;
}


int main()
{
    string delString = "MyDog:Odie, MyCat:Garfield  MyNumber:1001001";
//the delimiters are " " (space) and "," (comma) 
    vector<string> res = tokenizeMyString(delString, " ,");

    for (auto &i : res) {

        cout << "token: " << i << endl;
    }
return 0;
}

注意:tokenizeMyString 按值返回向量并首先在堆栈上创建它,因此我们在这里使用编译器的功能>>> RVO - 返回值优化:)

If you interesting in how to do it yourself and not using boost.

Assuming the delimiter string may be very long - let say M, checking for every char in your string if it is a delimiter, would cost O(M) each, so doing so in a loop for all chars in your original string, let say in length N, is O(M*N).

I would use a dictionary (like a map - "delimiter" to "booleans" - but here I would use a simple boolean array that has true in index = ascii value for each delimiter).

Now iterating on the string and check if the char is a delimiter is O(1), which eventually gives us O(N) overall.

Here is my sample code:

const int dictSize = 256;    

vector<string> tokenizeMyString(const string &s, const string &del)
{
    static bool dict[dictSize] = { false};

    vector<string> res;
    for (int i = 0; i < del.size(); ++i) {      
        dict[del[i]] = true;
    }

    string token("");
    for (auto &i : s) {
        if (dict[i]) {
            if (!token.empty()) {
                res.push_back(token);
                token.clear();
            }           
        }
        else {
            token += i;
        }
    }
    if (!token.empty()) {
        res.push_back(token);
    }
    return res;
}


int main()
{
    string delString = "MyDog:Odie, MyCat:Garfield  MyNumber:1001001";
//the delimiters are " " (space) and "," (comma) 
    vector<string> res = tokenizeMyString(delString, " ,");

    for (auto &i : res) {

        cout << "token: " << i << endl;
    }
return 0;
}

Note: tokenizeMyString returns vector by value and create it on the stack first, so we're using here the power of the compiler >>> RVO - return value optimization :)

输什么也不输骨气 2024-12-14 20:32:18

使用 Eric Niebler 的 range-v3 库:

https://godbolt.org/z/ZnxfSa

#include <string>
#include <iostream>
#include "range/v3/all.hpp"

int main()
{
    std::string s = "user1:192.168.0.1|user2:192.168.0.2|user3:192.168.0.3";
    auto words = s  
        | ranges::view::split('|')
        | ranges::view::transform([](auto w){
            return w | ranges::view::split(':');
        });
      ranges::for_each(words, [](auto i){ std::cout << i  << "\n"; });
}

Using Eric Niebler's range-v3 library:

https://godbolt.org/z/ZnxfSa

#include <string>
#include <iostream>
#include "range/v3/all.hpp"

int main()
{
    std::string s = "user1:192.168.0.1|user2:192.168.0.2|user3:192.168.0.3";
    auto words = s  
        | ranges::view::split('|')
        | ranges::view::transform([](auto w){
            return w | ranges::view::split(':');
        });
      ranges::for_each(words, [](auto i){ std::cout << i  << "\n"; });
}
抚笙 2024-12-14 20:32:18

多年以后,这里出现了使用 C++20 的解决方案:

constexpr std::string_view words{"Hello-_-C++-_-20-_-!"};
constexpr std::string_view delimeters{"-_-"};
for (const std::string_view word : std::views::split(words, delimeters)) {
    std::cout << std::quoted(word) << ' ';
}
// outputs: Hello C++ 20!

必需的标头:

#include <ranges>
#include <string_view>

参考: https ://en.cppreference.com/w/cpp/ranges/split_view

And here, ages later, a solution using C++20:

constexpr std::string_view words{"Hello-_-C++-_-20-_-!"};
constexpr std::string_view delimeters{"-_-"};
for (const std::string_view word : std::views::split(words, delimeters)) {
    std::cout << std::quoted(word) << ' ';
}
// outputs: Hello C++ 20!

Required headers:

#include <ranges>
#include <string_view>

Reference: https://en.cppreference.com/w/cpp/ranges/split_view

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文