用于引号和括号的 Boost.Tokenizer
我想使用 Boost.Tokenize 将字符串拆分为标记。要求引号或括号中的文本是一个完整的标记。更具体地说,我需要将一行分成类似的
"one (two),three" four (five "six".seven ) eight(nine, ten)
标记
one (two),three
four
(five "six".seven )
eight
(nine, ten)
,或者也许
one (two),three
four
(
five "six".seven
)
eight
(
nine, ten
)
我知道 如何标记引号中的文本,但我不知道如何同时标记括号中的文本。也许需要实现TokenizerFunction
。
如何按照我的描述拆分字符串?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
TokenizerFunction 是一个函子,有两个方法,两者都不是这应该很难实施。第一个是
reset
,它意味着重置仿函数可能具有的任何状态,另一个是operator()
,它采用三个参数。前两个是迭代器,第三个是结果标记。下面的算法很简单。首先,我们跳过任何空格。我们期望第一个非空格字符是三种类型之一。如果它是引号或左括号,那么我们将进行搜索,直到找到相应的结束分隔符并返回我们找到的标记,注意引号应该被删除,但括号显然应该保留。如果第一个字符是其他字符,则我们搜索下一个分隔符并返回它。
您可以将其实例化为 tokenizer >。如果您有不同的迭代器类型或不同的令牌类型,则需要在
tokenizer
和QuoteParenTokenizer
。如果您需要处理转义的分隔符,您可以变得更奇特。如果您需要括号表达式来嵌套,事情会变得更加棘手。
请注意,截至目前,上述代码尚未经过测试。
TokenizerFunction is a functor that has two methods, neither of which should be very difficult to implement. The first is
reset
, which is meant to reset any state the functor might have, and the other isoperator()
, which takes three parameters. The first two are iterators, and the third is the resulting token.The algorithm below is simple. First, we skip any spaces. We expect the first non-space character to be one of three kinds. If it's a quotation mark or left parenthesis, then we search until we find the corresponding closing delimiter and return what we find as the token, taking care that quotation marks are supposed to be stripped, but parentheses, apparently, are to remain. If the first character is something else, then we search to the next delimiter and return that instead.
You'd instantiate it as
tokenizer<QuoteParenTokenizer<> >
. If you have a different iterator type, or a different token type, you'll need to indicate them in the template parameters to bothtokenizer
andQuoteParenTokenizer
.You can get fancier if you need to handle escaped delimiter characters. Things will be trickier if you need parenthesized expressions to nest.
Beware that as of right now, the above code has not been tested.