使用 Boost.Spirit 从 HTML 中提取某些标签/属性
因此,我一直在学习一些有关 Boost.Spirit 的知识,以取代我的许多代码中正则表达式的使用。主要原因是纯粹的速度。我发现对于一些相对简单的任务,Boost.Spirit 比 PCRE 快 50 倍。
我的一个应用程序中的一大瓶颈是获取一些 HTML,查找所有“img”标签,并提取“src”属性。
这是我当前的正则表达式:
(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)
我一直在尝试让它在 Spirit 中工作,但到目前为止我一无所获。任何关于如何创建一组 Spirit 规则来完成与此正则表达式相同的事情的提示都会很棒。
So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.
One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.
This is my current regex:
(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)
I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当然,Boost Spirit 变体也不容错过:
说实话,Spirit 代码比其他变体稍微通用一些:
,Spirit 解析器会更容易适应跨行匹配。这很容易实现
spirit::istream_iterator
(不幸的是,它非常慢)如下:(完整代码位于 https://gist.github .com/c16725584493b021ba5b)
And of course, the Boost Spirit variant couldn't be missed:
To be honest the Spirit code is slightly more versatile than the other variations:
the Spirit parser would be easier to adapt to cross-line matching. This could be most easily achieved
spirit::istream_iterator<>
(which is unfortunately notoriously slow)const char*
as iterators; The latter approach works equally well for the other techniquesThe code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)
我衷心建议在这里使用正则表达式:
像这样使用它:
我看不出为什么基于精神的方法应该/可以更快?
编辑 PS。如果您声称静态优化是关键,为什么不将其转换为 Boost Expressive、静态、正则表达式呢?
I heartily suggest using a regex here:
Use it like:
I cannot see any reason why the spirit based approach should/could be any faster?
Edit PS. Iff you claim that static optimization would be the key, why not just convert it into a Boost Expressive, static, regular expression?
出于好奇,我基于 Boost Xpressive 重新设计了我的正则表达式示例,使用静态编译的正则表达式:
有趣的是,使用动态正则表达式时没有明显的速度差异;然而,总体而言,Xpressive 版本的性能优于 Boost Regex 版本(大约高出 10%)
相关代码如下:(完整代码位于https://gist.github.com/c16725584493b021ba5b)
Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:
Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)
The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)