反向偏移分词器

发布于 2024-07-08 12:09:15 字数 797 浏览 10 评论 0原文

我有一个要标记的字符串。其形式为 HHmmssff，其中 H、m、s、f 是数字。

它应该被标记为四个 2 位数字，但我需要它也接受简写形式，例如 sff，因此它将其解释为 00000sff。我想使用 boost::tokenizer 的 offset_separator 但它似乎只适用于正偏移量，我想让它向后工作。

好的，一个想法是从左侧用零填充字符串，但也许社区会想出一些超级-智能的东西。 ;)

编辑： 其他要求刚刚开始发挥作用。

对更智能的解决方案的基本需求是处理所有情况，例如 f、< code>ssff、mssff 等，但也接受更完整的时间表示法，例如 HH:mm:ss:ff 及其简写形式，例如 s:ff 甚至 s: （这个应该被解释为 s:00）。

在字符串以 : 结尾的情况下，我显然也可以用两个零填充它，然后删除所有分隔符，只留下数字，并用spirit解析生成的字符串。

但如果有一种方法可以使偏移分词器从字符串末尾（偏移量 -2、-4、-6、-8）返回并将数字转换为 ，那么似乎会更简单一些int。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花心好男孩 2024-07-15 12:09:15

我一直在宣扬 BNF 表示法。如果您可以写下定义问题的语法，则可以轻松地将其转换为 Boost.Spirit 解析器，它会为您完成此任务。

TimeString := LongNotation | ShortNotation

LongNotation := Hours Minutes Seconds Fractions

Hours := digit digit
Minutes := digit digit
Seconds := digit digit
Fraction := digit digit

ShortNotation := ShortSeconds Fraction
ShortSeconds := digit

编辑：附加约束

VerboseNotation = [ [ [ Hours ':' ] Minutes ':' ] Seconds ':' ]  Fraction

I keep preaching BNF notation. If you can write down the grammar that defines your problem, you can easily convert it into a Boost.Spirit parser, which will do it for you.

TimeString := LongNotation | ShortNotation

LongNotation := Hours Minutes Seconds Fractions

Hours := digit digit
Minutes := digit digit
Seconds := digit digit
Fraction := digit digit

ShortNotation := ShortSeconds Fraction
ShortSeconds := digit

Edit: additional constraint

VerboseNotation = [ [ [ Hours ':' ] Minutes ':' ] Seconds ':' ]  Fraction

回复收藏 0 原文

深海少女心 2024-07-15 12:09:15

回应评论“无意成为性能狂，但此解决方案涉及一些字符串复制（输入是 const & std::string）”。

如果你真的非常关心性能，以至于不能使用像正则表达式这样的大型旧库，不会冒 BNF 解析器的风险，也不想假设 std::string::substr 会避免分配副本（因此不能使用 STL 字符串函数），甚至不能将字符串字符复制到缓冲区和左侧填充“0”字符：

void parse(const string &s) {
    string::const_iterator current = s.begin();
    int HH = 0;
    int mm = 0;
    int ss = 0;
    int ff = 0;
    switch(s.size()) {
        case 8:
            HH = (*(current++) - '0') * 10;
        case 7:
            HH += (*(current++) - '0');
        case 6:
            mm = (*(current++) - '0') * 10;
        // ... you get the idea.
        case 1:
            ff += (*current - '0');
        case 0: break;
        default: throw logic_error("invalid date");
        // except that this code goes so badly wrong if the input isn't
        // valid that there's not much point objecting to the length...
   }
}

但从根本上讲，仅用 0 初始化这些 int 变量几乎与复制一样多将字符串放入带有填充的字符缓冲区中，因此我不希望看到任何显着的性能差异。因此，我实际上并不推荐在现实生活中使用这种解决方案，只是作为过早优化的练习。

In response to the comment "Don't mean to be a performance freak, but this solution involves some string copying (input is a const & std::string)".

If you really care about performance so much that you can't use a big old library like regex, won't risk a BNF parser, don't want to assume that std::string::substr will avoid a copy with allocation (and hence can't use STL string functions), and can't even copy the string chars into a buffer and left-pad with '0' characters:

void parse(const string &s) {
    string::const_iterator current = s.begin();
    int HH = 0;
    int mm = 0;
    int ss = 0;
    int ff = 0;
    switch(s.size()) {
        case 8:
            HH = (*(current++) - '0') * 10;
        case 7:
            HH += (*(current++) - '0');
        case 6:
            mm = (*(current++) - '0') * 10;
        // ... you get the idea.
        case 1:
            ff += (*current - '0');
        case 0: break;
        default: throw logic_error("invalid date");
        // except that this code goes so badly wrong if the input isn't
        // valid that there's not much point objecting to the length...
   }
}

But fundamentally, just 0-initialising those int variables is almost as much work as copying the string into a char buffer with padding, so I wouldn't expect to see any significant performance difference. I therefore don't actually recommend this solution in real life, just as an exercise in premature optimisation.

回复收藏 0 原文

太傻旳人生 2024-07-15 12:09:15

我想到了正则表达式。类似于 "^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?) $" 与 boost::regex。子匹配将为您提供数字值。采用数字之间带有冒号的其他格式应该不难（请参阅 sep61.myopenid.com 的答案）。 boost::regex 是最快的正则表达式解析器之一。

回复收藏 0 原文

~没有更多了~