反向偏移分词器
我有一个要标记的字符串。 其形式为 HHmmssff
,其中 H
、m
、s
、f
是数字。
它应该被标记为四个 2 位数字,但我需要它也接受简写形式,例如 sff
,因此它将其解释为 00000sff
。 我想使用 boost::tokenizer
的 offset_separator
但它似乎只适用于正偏移量,我想让它向后工作。
好的,一个想法是从左侧用零填充字符串,但也许社区会想出一些超级-智能的东西。 ;)
编辑: 其他要求刚刚开始发挥作用。
对更智能的解决方案的基本需求是处理所有情况,例如 f
、< code>ssff、mssff
等,但也接受更完整的时间表示法,例如 HH:mm:ss:ff
及其简写形式,例如 s:ff
甚至 s:
(这个应该被解释为 s:00
)。
在字符串以 :
结尾的情况下,我显然也可以用两个零填充它,然后删除所有分隔符,只留下数字,并用spirit解析生成的字符串。
但如果有一种方法可以使偏移分词器从字符串末尾(偏移量 -2、-4、-6、-8)返回并将数字转换为 ,那么似乎会更简单一些int
。
I have a string to tokenize. It's form is HHmmssff
where H
, m
, s
, f
are digits.
It's supposed to be tokenized into four 2-digit numbers, but I need it to also accept short-hand forms, like sff
so it interprets it as 00000sff
.
I wanted to use boost::tokenizer
's offset_separator
but it seems to work only with positive offsets and I'd like to have it work sort of backwards.
Ok, one idea is to pad the string with zeroes from the left, but maybe the community comes up with something uber-smart. ;)
Edit: Additional requirements have just come into play.
The basic need for a smarter solution was to handle all cases, like f
, ssff
, mssff
, etc. but also accept a more complete time notation, like HH:mm:ss:ff
with its short-hand forms, e.g. s:ff
or even s:
(this one's supposed to be interpreted as s:00
).
In the case where the string ends with :
I can obviously pad it with two zeroes as well, then strip out all separators leaving just the digits and parse the resulting string with spirit.
But it seems like it would be a bit simpler if there was a way to make the offset tokenizer going back from the end of string (offsets -2, -4, -6, -8) and lexically cast the numbers to int
s.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我一直在宣扬 BNF 表示法。 如果您可以写下定义问题的语法,则可以轻松地将其转换为 Boost.Spirit 解析器,它会为您完成此任务。
编辑:附加约束
I keep preaching BNF notation. If you can write down the grammar that defines your problem, you can easily convert it into a Boost.Spirit parser, which will do it for you.
Edit: additional constraint
回应评论“无意成为性能狂,但此解决方案涉及一些字符串复制(输入是 const & std::string)”。
如果你真的非常关心性能,以至于不能使用像正则表达式这样的大型旧库,不会冒 BNF 解析器的风险,也不想假设 std::string::substr 会避免分配副本(因此不能使用 STL 字符串函数),甚至不能将字符串字符复制到缓冲区和左侧填充“0”字符:
但从根本上讲,仅用 0 初始化这些 int 变量几乎与复制一样多将字符串放入带有填充的字符缓冲区中,因此我不希望看到任何显着的性能差异。 因此,我实际上并不推荐在现实生活中使用这种解决方案,只是作为过早优化的练习。
In response to the comment "Don't mean to be a performance freak, but this solution involves some string copying (input is a const & std::string)".
If you really care about performance so much that you can't use a big old library like regex, won't risk a BNF parser, don't want to assume that std::string::substr will avoid a copy with allocation (and hence can't use STL string functions), and can't even copy the string chars into a buffer and left-pad with '0' characters:
But fundamentally, just 0-initialising those int variables is almost as much work as copying the string into a char buffer with padding, so I wouldn't expect to see any significant performance difference. I therefore don't actually recommend this solution in real life, just as an exercise in premature optimisation.
我想到了正则表达式。 类似于
"^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?) $"
与boost::regex
。 子匹配将为您提供数字值。 采用数字之间带有冒号的其他格式应该不难(请参阅 sep61.myopenid.com 的答案)。boost::regex
是最快的正则表达式解析器之一。Regular Expressions come to mind. Something like
"^0*?(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)(\\d?\\d?)$"
withboost::regex
. Submatches will provide you with the digit values. Shouldn't be difficult to adopt to your other format with colons between numbers (see sep61.myopenid.com's answer).boost::regex
is among the fastest regex parsers out there.