使用 Boost.Spirit 从 HTML 中提取某些标签/属性

发布于 2024-12-20 22:30:24 字数 372 浏览 7 评论 0原文

因此，我一直在学习一些有关 Boost.Spirit 的知识，以取代我的许多代码中正则表达式的使用。主要原因是纯粹的速度。我发现对于一些相对简单的任务，Boost.Spirit 比 PCRE 快 50 倍。

我的一个应用程序中的一大瓶颈是获取一些 HTML，查找所有“img”标签，并提取“src”属性。

这是我当前的正则表达式：

(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)

我一直在尝试让它在 Spirit 中工作，但到目前为止我一无所获。任何关于如何创建一组 Spirit 规则来完成与此正则表达式相同的事情的提示都会很棒。

原文

So I've been learning a bit about Boost.Spirit to replace the use of regular expressions in a lot of my code. The main reason is pure speed. I've found Boost.Spirit to be up to 50 times faster than PCRE for some relatively simple tasks.

One thing that is a big bottleneck in one of my apps is taking some HTML, finding all "img" tags, and extracting the "src" attribute.

This is my current regex:

(?i:<img\s[^\>]*src\s*=\s*[""']([^<][^""']+)[^\>]*\s*/*>)

I've been playing around with it trying to get something to work in Spirit, but so far I've come up empty. Any tips on how to create a set of Spirit rules that will accomplish the same thing as this regex would be awesome.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅哥哥的热头脑 2024-12-27 22:30:24

当然，Boost Spirit 变体也不容错过：

sehe@natty:/tmp$ time ./spirit < bench > /dev/null

real    0m3.895s
user    0m3.820s
sys 0m0.070s

说实话，Spirit 代码比其他变体稍微通用一些：

它实际上解析属性更智能一些，因此可以很容易地处理各种属性同时，也许根据包含元素
，Spirit 解析器会更容易适应跨行匹配。这很容易实现
- 使用spirit::istream_iterator（不幸的是，它非常慢）
- 使用带有原始 const char* 的内存映射文件作为迭代器；后一种方法对于其他技术同样有效

如下：（完整代码位于 https://gist.github .com/c16725584493b021ba5b）

//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

void handle_attr(
        const std::string& elem, 
        const std::string& attr, 
        const std::string& value)
{
    if (elem == "img" && attr == "src")
        std::cout << "value : " << value << std::endl;
}

typedef std::string::const_iterator It;
typedef qi::space_type Skipper;

struct grammar : qi::grammar<It, Skipper>
{
    grammar() : grammar::base_type(html)
    {
        using namespace boost::spirit::qi;
        using phx::bind;

        attr = as_string [ +~char_("= \t\r\n/>") ] [ _a = _1 ]
                >> '=' >> (
                    as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ]
                  | as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ]
                  ) [ bind(handle_attr, _r1, _a, _1) ]
            ;

        elem = lit('<') 
            >> as_string [ lexeme [ ~char_("-/>") >> *(char_ - space - char_("/>")) ] ] [ _a = _1 ]
            >> *attr(_a);

        html = (-elem) % +("</" | (char_ - '<'));

        BOOST_SPIRIT_DEBUG_NODE(html);
        BOOST_SPIRIT_DEBUG_NODE(elem);
        BOOST_SPIRIT_DEBUG_NODE(attr);
    }

    qi::rule<It, Skipper> html;
    qi::rule<It, Skipper, qi::locals<std::string> > elem;
    qi::rule<It, qi::unused_type(std::string), Skipper, qi::locals<std::string> > attr;
};

int main(int argc, const char *argv[])
{
    std::string s;

    const static grammar html_;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(),
           l = s.end();

        if (!phrase_parse(f, l, html_, qi::space) || (f!=l))
            std::cerr << "unparsed: " << std::string(f,l) << std::endl;
    }

    return 0;
}

And of course, the Boost Spirit variant couldn't be missed:

sehe@natty:/tmp$ time ./spirit < bench > /dev/null

real    0m3.895s
user    0m3.820s
sys 0m0.070s

To be honest the Spirit code is slightly more versatile than the other variations:

it actually parses attributes a bit smarter, so it would be easy to handle a variety of attributes at the same time, perhaps depending on the containing element
the Spirit parser would be easier to adapt to cross-line matching. This could be most easily achieved
- using spirit::istream_iterator<> (which is unfortunately notoriously slow)
- using a memory-mapped file with raw const char* as iterators; The latter approach works equally well for the other techniques

The code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)

//#define BOOST_SPIRIT_DEBUG
#include <string>
#include <iostream>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi  = boost::spirit::qi;
namespace phx = boost::phoenix;

void handle_attr(
        const std::string& elem, 
        const std::string& attr, 
        const std::string& value)
{
    if (elem == "img" && attr == "src")
        std::cout << "value : " << value << std::endl;
}

typedef std::string::const_iterator It;
typedef qi::space_type Skipper;

struct grammar : qi::grammar<It, Skipper>
{
    grammar() : grammar::base_type(html)
    {
        using namespace boost::spirit::qi;
        using phx::bind;

        attr = as_string [ +~char_("= \t\r\n/>") ] [ _a = _1 ]
                >> '=' >> (
                    as_string [ '"' >> lexeme [ *~char_('"') ] >> '"' ]
                  | as_string [ "'" >> lexeme [ *~char_("'") ] >> "'" ]
                  ) [ bind(handle_attr, _r1, _a, _1) ]
            ;

        elem = lit('<') 
            >> as_string [ lexeme [ ~char_("-/>") >> *(char_ - space - char_("/>")) ] ] [ _a = _1 ]
            >> *attr(_a);

        html = (-elem) % +("</" | (char_ - '<'));

        BOOST_SPIRIT_DEBUG_NODE(html);
        BOOST_SPIRIT_DEBUG_NODE(elem);
        BOOST_SPIRIT_DEBUG_NODE(attr);
    }

    qi::rule<It, Skipper> html;
    qi::rule<It, Skipper, qi::locals<std::string> > elem;
    qi::rule<It, qi::unused_type(std::string), Skipper, qi::locals<std::string> > attr;
};

int main(int argc, const char *argv[])
{
    std::string s;

    const static grammar html_;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(),
           l = s.end();

        if (!phrase_parse(f, l, html_, qi::space) || (f!=l))
            std::cerr << "unparsed: " << std::string(f,l) << std::endl;
    }

    return 0;
}

回复收藏 0 原文

|煩躁 2024-12-27 22:30:24

更新
我做了基准测试。
完整披露在这里：https://gist.github.com/c16725584493b021ba5b
它包括使用的完整代码、编译标志和使用的测试数据主体（文件bench）。
简而言之
正则表达式在这里确实更快、更简单
不要低估我调试 Spirit 语法以使其正确所花费的时间！
已采取措施消除“意外”差异（例如
在各个实现中保持 handle_attribute 不变，尽管它主要只对 Spirit 实现有意义）。
使用相同的逐行输入样式和字符串迭代器
目前，所有三种实现都会产生完全相同的输出
所有内容均在 g++ 4.6.1（c++03 模式）、-O3 上构建/计时
<小时>
编辑以回复您不应该使用正则表达式解析 HTML 的下意识（并且正确）响应：
您不应该使用正则表达式来解析重要的输入（主要是任何带有语法的内容。当然是 Perl 5.10+ '正则表达式语法' 是一个例外，因为它们不再是孤立的正则表达式
HTML基本上无法解析，是非标准标签汤。严格的 (X)HTML 是另一回事
根据 Xaade 的说法，如果您没有足够的时间使用符合标准的 HTML 阅读器来生成完美的实现，那么您应该
<块引用>
“询问客户是否想要屎。如果他们想要屎，你就向他们收取更多费用。屎比他们花费更多。” -- Xaade
^{也就是说，在某些情况下我会完全按照我在这里建议的操作：使用正则表达式。主要是，如果是一次性快速搜索或获取已知数据的每日粗略统计等。YMMV，您应该自己打电话。}
有关时间安排和摘要，请参阅：
Boost Regex 答案如下
Boost Xpressive 答案此处
精神答案这里

我衷心建议在这里使用正则表达式：

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    const boost::regex re("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");

    std::string s;
    boost::smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!boost::regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }
    
    return 0;
}

像这样使用它：

./test < index.htm

我看不出为什么基于精神的方法应该/可以更快？

编辑 PS。如果您声称静态优化是关键，为什么不将其转换为 Boost Expressive、静态、正则表达式呢？

Update
I did benchmarks.
Full disclosure is here: https://gist.github.com/c16725584493b021ba5b
It includes the full code used, the compilation flags and the body of test data (file bench) used.
In short
Regular expressions are indeed faster and way simpler here
Do not underestimate the time I spent debugging the Spirit grammar to get it correct!
Care has been taken to eliminate 'accidental' differences (by e.g.
keeping handle_attribute unchanged across the implementations, even though it makes sense mostly only for the Spirit implementation).
using the same line-wise input style and string iterators for both
Right now, all three implementations result in the exact same output
Everything built/timed on g++ 4.6.1 (c++03 mode), -O3
Edit in reply to the knee-jerk (and correct) response that you shouldn't be parsing HTML using Regexes:
You shouldn't be using regexen to parse non-trivial inputs (mainly, anything with a grammar. Of course Perl 5.10+ 'regex grammars' are an exception, because they are not isolated regexes anymore
HTML basically cannot be parsed, it is non-standard tag soup. Strict (X)HTML, are a different matter
According to Xaade, if you haven't got enough time to produce a perfect implementation using a standards compliant HTML reader, you should
"ask client if they want shit or not. If they want shit, you charge them more. Shit costs you more than them." -- Xaade
^{That said there are scenarios in which I'd do precisely what I suggest here: use a regex. Mainly, if it is to do a one-off quick search or to get daily, rough statistics of known data etc. YMMV and you should make your own call.}
For timings and summaries, see:
Boost Regex answer below
Boost Xpressive answer here
Spirit answer here

I heartily suggest using a regex here:

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    const boost::regex re("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");

    std::string s;
    boost::smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!boost::regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }
    
    return 0;
}

Use it like:

./test < index.htm

I cannot see any reason why the spirit based approach should/could be any faster?

Edit PS. Iff you claim that static optimization would be the key, why not just convert it into a Boost Expressive, static, regular expression?

回复收藏 0 原文

裸钻 2024-12-27 22:30:24

出于好奇，我基于 Boost Xpressive 重新设计了我的正则表达式示例，使用静态编译的正则表达式：

sehe@natty:/tmp$ time ./expressive < bench > /dev/null

real    0m2.146s
user    0m2.110s
sys 0m0.030s

有趣的是，使用动态正则表达式时没有明显的速度差异；然而，总体而言，Xpressive 版本的性能优于 Boost Regex 版本（大约高出 10%）

在我看来，真正好的一点是，几乎只需包含 xpressive.hpp 并更改一些命名空间即可从 Boost Regex 更改为 Xpressive。 API 接口（就其使用而言）完全相同。

相关代码如下：（完整代码位于https://gist.github.com/c16725584493b021ba5b）

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    using namespace boost::xpressive;
#if DYNAMIC
    const sregex re = sregex::compile
         ("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
    const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >> 
        "src" >> *_s >> '=' >> *_s
        >> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif

    std::string s;
    smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }

    return 0;
}

Out of curiosity I redid my regex sample based on Boost Xpressive, using statically compiled regexes:

sehe@natty:/tmp$ time ./expressive < bench > /dev/null

real    0m2.146s
user    0m2.110s
sys 0m0.030s

Interestingly, there is no discernable speed difference when using the dynamic regular expression; however, on the whole the Xpressive version performs better than the Boost Regex version (by roughly 10%)

What is really nice, IMO, is that it was really almost matter of including the xpressive.hpp and changing a few namespaces around to change from Boost Regex to Xpressive. The API interface (as far as it was being used) is exactly the same.

The relevant code is as follows: (full code at https://gist.github.com/c16725584493b021ba5b)

typedef std::string::const_iterator It;

int main(int argc, const char *argv[])
{
    using namespace boost::xpressive;
#if DYNAMIC
    const sregex re = sregex::compile
         ("<img\\s+[^\\>]*?src\\s*=\\s*([\"'])(.*?)\\1");
#else
    const sregex re = "<img" >> +_s >> -*(~(set = '\\','>')) >> 
        "src" >> *_s >> '=' >> *_s
        >> (s1 = as_xpr('"') | '\'') >> (s2 = -*_) >> s1;
#endif

    std::string s;
    smatch what;

    while (std::getline(std::cin, s))
    {
        It f = s.begin(), l = s.end();

        do
        {
            if (!regex_search(f, l, what, re))
                break;

            handle_attr("img", "src", what[2]);
            f = what[0].second;
        } while (f!=s.end());
    }

    return 0;
}

回复收藏 0 原文

~没有更多了~