使用带有字符串分隔符的 boost::tokenizer

发布于 2024-08-01 13:11:40 字数 157 浏览 6 评论 0原文

我一直在寻找 boost::tokenizer，并且发现文档非常薄。是否可以让它标记一个字符串，例如“dolphin--monkey--baboon”，并使每个单词成为一个标记，以及每个双破折号成为一个标记？从示例中我只看到允许使用单个字符分隔符。该库对于更复杂的分隔符来说还不够先进吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人海汹涌 2024-08-08 13:11:40

我知道这个主题已经很老了，但是当我搜索“boost tokenizer by string”时，它会显示在谷歌的顶部链接中，

所以我将添加我的 TokenizerFunction 变体，以防万一：

class FindStrTFunc
{
public:
    FindStrTFunc() : m_str(g_dataSeparator)
    {
    }

    bool operator()(std::string::const_iterator& next,
        const std::string::const_iterator& end, std::string& tok) const
    {
        if (next == end)
        {
            return false;
        }
        const std::string::const_iterator foundToken =
            std::search(next, end, m_str.begin(), m_str.end());
        tok.assign(next, foundToken);
        next = (foundToken == end) ? end : foundToken + m_str.size();
        return true;
    }

    void reset()
    {
    }

private:
    std::string m_str;
};

在我们可以

boost::tokenizer<FindStrTFunc> tok("some input...some other input");

像平常一样创建和使用之后增强分词器

I know the theme is quite old, but it is shown in the top links in google when I search "boost tokenizer by string"

so I will add my variant of TokenizerFunction, just in case:

class FindStrTFunc
{
public:
    FindStrTFunc() : m_str(g_dataSeparator)
    {
    }

    bool operator()(std::string::const_iterator& next,
        const std::string::const_iterator& end, std::string& tok) const
    {
        if (next == end)
        {
            return false;
        }
        const std::string::const_iterator foundToken =
            std::search(next, end, m_str.begin(), m_str.end());
        tok.assign(next, foundToken);
        next = (foundToken == end) ? end : foundToken + m_str.size();
        return true;
    }

    void reset()
    {
    }

private:
    std::string m_str;
};

after we can create

boost::tokenizer<FindStrTFunc> tok("some input...some other input");

and use, like a usual boost tokenizer

回复收藏 0 原文

囍孤女 2024-08-08 13:11:40

使用 iter_split 允许您使用多个字符标记。
下面的代码将产生以下结果：

海豚
猴子
狒狒

#include <iostream>
#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/algorithm/string/iter_find.hpp>

    // code starts here
    std::string s = "dolphin--mon-key--baboon";
    std::list<std::string> stringList;
    boost::iter_split(stringList, s, boost::first_finder("--"));

    BOOST_FOREACH(std::string token, stringList)
    {    
        std::cout << token << '\n';  ;
    }

using iter_split allows you to use multiple character tokens.
The code below would produce the following:

dolphin
mon-key
baboon

#include <iostream>
#include <boost/foreach.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/algorithm/string/iter_find.hpp>

    // code starts here
    std::string s = "dolphin--mon-key--baboon";
    std::list<std::string> stringList;
    boost::iter_split(stringList, s, boost::first_finder("--"));

    BOOST_FOREACH(std::string token, stringList)
    {    
        std::cout << token << '\n';  ;
    }

回复收藏 0 原文

久而酒知 2024-08-08 13:11:40

一种选择是尝试 boost::regex。不确定与自定义分词器相比的性能。

std::string s = "dolphin--monkey--baboon";

boost::regex re("[a-z|A-Z]+|--");
boost::sregex_token_iterator iter(s.begin(), s.end() , re, 0);
boost::sregex_token_iterator end_iter;

while(iter != end_iter)
{
    std::cout << *iter << '\n';
    ++iter;
}

One option is to try boost::regex. Not sure of the performance compared to a custom tokenizer.

std::string s = "dolphin--monkey--baboon";

boost::regex re("[a-z|A-Z]+|--");
boost::sregex_token_iterator iter(s.begin(), s.end() , re, 0);
boost::sregex_token_iterator end_iter;

while(iter != end_iter)
{
    std::cout << *iter << '\n';
    ++iter;
}

回复收藏 0 原文