当前位置：文江博客话题详情

用于引号和括号的 Boost.Tokenizer

发布于 2025-01-07 13:59:42 字数 586 浏览 4 评论 0 原文

我想使用 Boost.Tokenize 将字符串拆分为标记。要求引号或括号中的文本是一个完整的标记。更具体地说，我需要将一行分成类似的

"one (two),three" four (five "six".seven ) eight(nine, ten)

标记

one (two),three
four
(five "six".seven )
eight
(nine, ten)

，或者也许

one (two),three
four
(
five "six".seven
)
eight
(
nine, ten
)

我知道如何标记引号中的文本，但我不知道如何同时标记括号中的文本。也许需要实现TokenizerFunction。
如何按照我的描述拆分字符串？

原文

I'd like to split a string into tokens using Boost.Tokenize. It is required that text in quotes or parentheses is a single whole token. More specifically, I need split a line like

"one (two),three" four (five "six".seven ) eight(nine, ten)

into tokens like

one (two),three
four
(five "six".seven )
eight
(nine, ten)

or maybe

one (two),three
four
(
five "six".seven
)
eight
(
nine, ten
)

I know the way to tokenize a text in quotation marks, but I have no idea how at the same time tokenize a text in parenteses. Maybe need to implement TokenizerFunction.
How to split a string as I described?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我一直都在从未离去 2025-01-14 13:59:42

TokenizerFunction 是一个函子，有两个方法，两者都不是这应该很难实施。第一个是 reset，它意味着重置仿函数可能具有的任何状态，另一个是 operator()，它采用三个参数。前两个是迭代器，第三个是结果标记。

下面的算法很简单。首先，我们跳过任何空格。我们期望第一个非空格字符是三种类型之一。如果它是引号或左括号，那么我们将进行搜索，直到找到相应的结束分隔符并返回我们找到的标记，注意引号应该被删除，但括号显然应该保留。如果第一个字符是其他字符，则我们搜索下一个分隔符并返回它。

template <
  typename Iter = std::string::const_iterator,
  typename Type = std::string
  >
struct QuoteParenTokenizer
{
  void reset() { }

  bool operator()(Iter& next, Iter end, Type& tok) const
  {
    while (next != end && *next == ' ')
      ++next;
    if (next == end)
      return false; // nothing left to read

    switch (*next) {
      case '"': {
        ++next; // skip token start
        Item const quote = std::find(next, end, '"');
        if (quote == end)
          return false; // unterminated token
        tok.assign(next, quote);
        next = quote;
        ++next;
        break;
      }
      case '(': {
        Iter paren = std::find(next, end, ')');
        if (paren == end)
          return false; // unterminated token
        ++paren; // include the parenthesis
        tok.assign(next, paren);
        next = paren;
        break;
      }
      default: {
        Iter const first = next;
        while (next != end && *next != ' ' && *next != '"' && *next != '(')
          ++next;
        tok.assign(first, next);
      }
    }
    return true;
  }
};

您可以将其实例化为 tokenizer >。如果您有不同的迭代器类型或不同的令牌类型，则需要在 tokenizer 和 QuoteParenTokenizer。

如果您需要处理转义的分隔符，您可以变得更奇特。如果您需要括号表达式来嵌套，事情会变得更加棘手。

请注意，截至目前，上述代码尚未经过测试。

TokenizerFunction is a functor that has two methods, neither of which should be very difficult to implement. The first is reset, which is meant to reset any state the functor might have, and the other is operator(), which takes three parameters. The first two are iterators, and the third is the resulting token.

The algorithm below is simple. First, we skip any spaces. We expect the first non-space character to be one of three kinds. If it's a quotation mark or left parenthesis, then we search until we find the corresponding closing delimiter and return what we find as the token, taking care that quotation marks are supposed to be stripped, but parentheses, apparently, are to remain. If the first character is something else, then we search to the next delimiter and return that instead.

template <
  typename Iter = std::string::const_iterator,
  typename Type = std::string
  >
struct QuoteParenTokenizer
{
  void reset() { }

  bool operator()(Iter& next, Iter end, Type& tok) const
  {
    while (next != end && *next == ' ')
      ++next;
    if (next == end)
      return false; // nothing left to read

    switch (*next) {
      case '"': {
        ++next; // skip token start
        Item const quote = std::find(next, end, '"');
        if (quote == end)
          return false; // unterminated token
        tok.assign(next, quote);
        next = quote;
        ++next;
        break;
      }
      case '(': {
        Iter paren = std::find(next, end, ')');
        if (paren == end)
          return false; // unterminated token
        ++paren; // include the parenthesis
        tok.assign(next, paren);
        next = paren;
        break;
      }
      default: {
        Iter const first = next;
        while (next != end && *next != ' ' && *next != '"' && *next != '(')
          ++next;
        tok.assign(first, next);
      }
    }
    return true;
  }
};

You'd instantiate it as tokenizer<QuoteParenTokenizer<> >. If you have a different iterator type, or a different token type, you'll need to indicate them in the template parameters to both tokenizer and QuoteParenTokenizer.

You can get fancier if you need to handle escaped delimiter characters. Things will be trickier if you need parenthesized expressions to nest.

Beware that as of right now, the above code has not been tested.

回复收藏 0 原文

~没有更多了~

关于作者

长伴

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

用于引号和括号的 Boost.Tokenizer

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

用于引号和括号的 Boost.Tokenizer

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。