反向偏移分词器

发布于 2024-07-08 12:09:15 字数 797 浏览 7 评论 0原文

我有一个要标记的字符串。 其形式为 HHmmssff,其中 Hmsf 是数字。

它应该被标记为四个 2 位数字,但我需要它也接受简写形式,例如 sff,因此它将其解释为 00000sff。 我想使用 boost::tokenizeroffset_separator 但它似乎只适用于正偏移量,我想让它向后工作。

好的,一个想法是从左侧用零填充字符串,但也许社区会想出一些超级-智能的东西。 ;)

编辑: 其他要求刚刚开始发挥作用。

对更智能的解决方案的基本需求是处理所有情况,例如 f、< code>ssff、mssff 等,但也接受更完整的时间表示法,例如 HH:mm:ss:ff 及其简写形式,例如 s:ff 甚至 s: (这个应该被解释为 s:00)。

在字符串以 : 结尾的情况下,我显然也可以用两个零填充它,然后删除所有分隔符,只留下数字,并用spirit解析生成的字符串。

但如果有一种方法可以使偏移分词器从字符串末尾(偏移量 -2、-4、-6、-8)返回并将数字转换为 ,那么似乎会更简单一些int

I have a string to tokenize. It's form is HHmmssff where H, m, s, f are digits.

It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff so it interprets it as 00000sff.
I wanted to use boost::tokenizer's offset_separator but it seems to work only with positive offsets and I'd like to have it work sort of backwards.

Ok, one idea is to pad the string with zeroes from the left, but maybe the community comes up with something uber-smart. ;)

Edit: Additional requirements have just come into play.

The basic need for a smarter solution was to handle all cases, like f, ssff, mssff, etc. but also accept a more complete time notation, like HH:mm:ss:ff with its short-hand forms, e.g. s:ff or even s: (this one's supposed to be interpreted as s:00).

In the case where the string ends with : I can obviously pad it with two zeroes as well, then strip out all separators leaving just the digits and parse the resulting string with spirit.

But it seems like it would be a bit simpler if there was a way to make the offset tokenizer going back from the end of string (offsets -2, -4, -6, -8) and lexically cast the numbers to ints.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

花心好男孩 2024-07-15 12:09:15

我一直在宣扬 BNF 表示法。 如果您可以写下定义问题的语法,则可以轻松地将其转换为 Boost.Spirit 解析器,它会为您完成此任务。

TimeString := LongNotation | ShortNotation

LongNotation := Hours Minutes Seconds Fractions

Hours := digit digit
Minutes := digit digit
Seconds := digit digit
Fraction := digit digit

ShortNotation := ShortSeconds Fraction
ShortSeconds := digit

编辑:附加约束

VerboseNotation = [ [ [ Hours ':' ] Minutes ':' ] Seconds ':' ]  Fraction

I keep preaching BNF notation. If you can write down the grammar that defines your problem, you can easily convert it into a Boost.Spirit parser, which will do it for you.

TimeString := LongNotation | ShortNotation

LongNotation := Hours Minutes Seconds Fractions

Hours := digit digit
Minutes := digit digit
Seconds := digit digit
Fraction := digit digit

ShortNotation := ShortSeconds Fraction
ShortSeconds := digit

Edit: additional constraint

VerboseNotation = [ [ [ Hours ':' ] Minutes ':' ] Seconds ':' ]  Fraction
深海少女心 2024-07-15 12:09:15

回应评论“无意成为性能狂,但此解决方案涉及一些字符串复制(输入是 const & std::string)”。

如果你真的非常关心性能,以至于不能使用像正则表达式这样的大型旧库,不会冒 BNF 解析器的风险,也不想假设 std::string::substr 会避免分配副本(因此不能使用 STL 字符串函数),甚至不能将字符串字符复制到缓冲区和左侧填充“0”字符:

void parse(const string &s) {
    string::const_iterator current = s.begin();
    int HH = 0;
    int mm = 0;
    int ss = 0;
    int ff = 0;
    switch(s.size()) {
        case 8:
            HH = (*(current++) - '0') * 10;
        case 7:
            HH += (*(current++) - '0');
        case 6:
            mm = (*(current++) - '0') * 10;
        // ... you get the idea.
        case 1:
            ff += (*current - '0');
        case 0: break;
        default: throw logic_error("invalid date");
        // except that this code goes so badly wrong if the input isn't
        // valid that there's not much point objecting to the length...
   }
}

但从根本上讲,仅用 0 初始化这些 int 变量几乎与复制一样多将字符串放入带有填充的字符缓冲区中,因此我不希望看到任何显着的性能差异。 因此,我实际上并不推荐在现实生活中使用这种解决方案,只是作为过早优化的练习。

In response to the comment "Don't mean to be a performance freak, but this solution involves some string copying (input is a const & std::string)".

If you really care about performance so much that you can't use a big old library like regex, won't risk a BNF parser, don't want to assume that std::string::substr will avoid a copy with allocation (and hence can't use STL string functions), and can't even copy the string chars into a buffer and left-pad with '0' characters:

void parse(const string &s) {
    string::const_iterator current = s.begin();
    int HH = 0;
    int mm = 0;
    int ss = 0;
    int ff = 0;
    switch(s.size()) {
        case 8:
            HH = (*(current++) - '0') * 10;
        case 7:
            HH += (*(current++) - '0');
        case 6:
            mm = (*(current++) - '0') * 10;
        // ... you get the idea.
        case 1:
            ff += (*current - '0');
        case 0: break;
        default: throw logic_error("invalid date");
        // except that this code goes so badly wrong if the input isn't
        // valid that there's not much point objecting to the length...
   }
}

But fundamentally, just 0-initialising those int variables is almost as much work as copying the string into a char buffer with padding, so I wouldn't expect to see any significant performance difference. I therefore don't actually recommend this solution in real life, just as an exercise in premature optimisation.

太傻旳人生 2024-07-15 12:09:15

我想到了正则表达式。 类似于 "^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?) $"boost::regex。 子匹配将为您提供数字值。 采用数字之间带有冒号的其他格式应该不难(请参阅 sep61.myopenid.com 的答案)。 boost::regex 是最快的正则表达式解析器之一。

Regular Expressions come to mind. Something like "^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)$" with boost::regex. Submatches will provide you with the digit values. Shouldn't be difficult to adopt to your other format with colons between numbers (see sep61.myopenid.com's answer). boost::regex is among the fastest regex parsers out there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文