使用带有字符串分隔符的 boost::tokenizer
我一直在寻找 boost::tokenizer,并且发现文档非常薄。 是否可以让它标记一个字符串,例如“dolphin--monkey--baboon”,并使每个单词成为一个标记,以及每个双破折号成为一个标记? 从示例中我只看到允许使用单个字符分隔符。 该库对于更复杂的分隔符来说还不够先进吗?
I've been looking boost::tokenizer, and I've found that the documentation is very thin. Is it possible to make it tokenize a string such as "dolphin--monkey--baboon" and make every word a token, as well as every double dash a token? From the examples I've only seen single character delimiters being allowed. Is the library not advanced enough for more complicated delimiters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我知道这个主题已经很老了,但是当我搜索“boost tokenizer by string”时,它会显示在谷歌的顶部链接中,
所以我将添加我的 TokenizerFunction 变体,以防万一:
在我们可以
像平常一样创建和使用 之后增强分词器
I know the theme is quite old, but it is shown in the top links in google when I search "boost tokenizer by string"
so I will add my variant of TokenizerFunction, just in case:
after we can create
and use, like a usual boost tokenizer
使用 iter_split 允许您使用多个字符标记。
下面的代码将产生以下结果:
海豚
猴子
狒狒
using iter_split allows you to use multiple character tokens.
The code below would produce the following:
dolphin
mon-key
baboon
一种选择是尝试 boost::regex。 不确定与自定义分词器相比的性能。
One option is to try boost::regex. Not sure of the performance compared to a custom tokenizer.
看起来您需要编写自己的 TokenizerFunction 做你想做的事。
It looks like you will need to write your own TokenizerFunction to do what you want.