当前位置：文江博客话题详情

c++ screen-scraping

网页抓取选项 - C++ 仅版本

发布于 2024-07-18 07:34:57 字数 211 浏览 5 评论 0原文

我正在寻找一个好的 C++ 库来进行网页抓取。
它必须是C/C++，没有其他，所以请不要引导我HTML 抓取选项或其他 SO 问题/答案，其中甚至没有提到 C++。

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（4）

惯饮孤独 2024-07-25 07:34:57

libcurl 下载 html 文件
libtidy 转换为有效的 xml
libxml 解析/导航 xml

回复收藏 0 原文

柳絮泡泡 2024-07-25 07:34:57

此处使用 myhtml C/C++ 解析器；非常简单，非常快。除 C99 之外没有任何依赖项。并内置 CSS 选择器（示例此处）

回复收藏 0 原文

疯了 2024-07-25 07:34:57

我推荐 Qt5.6.2，这个强大的库为我们提供了

高级、直观、异步的网络 api，如 QNetworkAccessManager、QNetworkReply、QNetworkProxy 等
强大的正则表达式类，如 QRegularExpression
体面的 Web 引擎，如 QtWebEngine
强大、成熟的 gui，如 QWidgets
大多数 Qt5 api 都设计良好、信号和槽也使编写异步代码变得更加容易
出色的 unicode 支持
功能丰富的文件系统库。无论是创建、删除、重命名还是查找保存文件的标准路径，在 Qt5 中都是小菜一碟
QNetworkAccessManager 的异步 api 可以轻松地一次生成多个下载请求
跨主要桌面平台、windows、mac os 和 linux，在任何地方编译一次即可编写，仅一个代码库。
易于在 windows 和 mac 上部署（linux？也许 linuxdeployqt 可以为我们省去很多麻烦）
易于在 windows、mac 和 linux 上安装
等等

我已经用 Qt5 编写了一个图像抓取应用程序，这个应用程序可以抓取几乎所有搜索到的图像谷歌、必应和雅虎。

要了解更多详细信息，请访问我的github项目。
我写下了有关如何通过 Qt5 抓取数据的高级概述
我的博客（在堆栈溢出上发布太长了）。

回复收藏 0 原文

沦落红尘 2024-07-25 07:34:57

// download winhttpclient.h
// --------------------------------
#include <winhttp\WinHttpClient.h>
using namespace std;
typedef unsigned char byte;
#define foreach         BOOST_FOREACH
#define reverse_foreach BOOST_REVERSE_FOREACH

bool substrexvealue(const std::wstring& html,const std::string& tg1,const std::string& tg2,std::string& value, long& next) {
    long p1,p2;
    std::wstring wtmp;
    std::wstring wtg1(tg1.begin(),tg1.end());
    std::wstring wtg2(tg2.begin(),tg2.end());

    p1=html.find(wtg1,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(wtg2,next);
        if(p2!=std::wstring::npos) {
            p1+=wtg1.size();
            wtmp=html.substr(p1,p2-p1-1);
            value=std::string(wtmp.begin(),wtmp.end());
            boost::trim(value);
            next=p1+1;
        }
    }
    return p1!=std::wstring::npos;
}
bool extractvalue(const std::wstring& html,const std::string& tag,std::string& value, long& next) {
    long p1,p2,p3;
    std::wstring wtmp;
    std::wstring wtag(tag.begin(),tag.end());

    p1=html.find(wtag,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(L">",p1+wtag.size()-1);
        p3=html.find(L"<",p2+1);
        wtmp=html.substr(p2+1,p3-p2-1);
        value=std::string(wtmp.begin(),wtmp.end());
        boost::trim(value);
        next=p1+1;
    }
    return p1!=std::wstring::npos;
}
bool GetHTML(const std::string& url,std::wstring& header,std::wstring& hmtl) {
    std::wstring wurl = std::wstring(url.begin(),url.end());
    bool ret=false;
    try {
        WinHttpClient client(wurl.c_str());
        std::string url_protocol=url.substr(0,5);
        std::transform(url_protocol.begin(), url_protocol.end(), url_protocol.begin(), (int (*)(int))std::toupper);
        if(url_protocol=="HTTPS")    client.SetRequireValidSslCertificates(false);
        client.SetUserAgent(L"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        if(client.SendHttpRequest()) {
            header = client.GetResponseHeader();
            hmtl = client.GetResponseContent();
            ret=true;
        }
    }catch(...) {
        header=L"Error";
        hmtl=L"";
    }
    return ret;
}
int main() {
    std::string url = "http://www.google.fr";
    std::wstring header,html;
    GetHTML(url,header,html));
}

// download winhttpclient.h
// --------------------------------
#include <winhttp\WinHttpClient.h>
using namespace std;
typedef unsigned char byte;
#define foreach         BOOST_FOREACH
#define reverse_foreach BOOST_REVERSE_FOREACH

bool substrexvealue(const std::wstring& html,const std::string& tg1,const std::string& tg2,std::string& value, long& next) {
    long p1,p2;
    std::wstring wtmp;
    std::wstring wtg1(tg1.begin(),tg1.end());
    std::wstring wtg2(tg2.begin(),tg2.end());

    p1=html.find(wtg1,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(wtg2,next);
        if(p2!=std::wstring::npos) {
            p1+=wtg1.size();
            wtmp=html.substr(p1,p2-p1-1);
            value=std::string(wtmp.begin(),wtmp.end());
            boost::trim(value);
            next=p1+1;
        }
    }
    return p1!=std::wstring::npos;
}
bool extractvalue(const std::wstring& html,const std::string& tag,std::string& value, long& next) {
    long p1,p2,p3;
    std::wstring wtmp;
    std::wstring wtag(tag.begin(),tag.end());

    p1=html.find(wtag,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(L">",p1+wtag.size()-1);
        p3=html.find(L"<",p2+1);
        wtmp=html.substr(p2+1,p3-p2-1);
        value=std::string(wtmp.begin(),wtmp.end());
        boost::trim(value);
        next=p1+1;
    }
    return p1!=std::wstring::npos;
}
bool GetHTML(const std::string& url,std::wstring& header,std::wstring& hmtl) {
    std::wstring wurl = std::wstring(url.begin(),url.end());
    bool ret=false;
    try {
        WinHttpClient client(wurl.c_str());
        std::string url_protocol=url.substr(0,5);
        std::transform(url_protocol.begin(), url_protocol.end(), url_protocol.begin(), (int (*)(int))std::toupper);
        if(url_protocol=="HTTPS")    client.SetRequireValidSslCertificates(false);
        client.SetUserAgent(L"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        if(client.SendHttpRequest()) {
            header = client.GetResponseHeader();
            hmtl = client.GetResponseContent();
            ret=true;
        }
    }catch(...) {
        header=L"Error";
        hmtl=L"";
    }
    return ret;
}
int main() {
    std::string url = "http://www.google.fr";
    std::wstring header,html;
    GetHTML(url,header,html));
}

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

0 文章

0 评论

22 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

linfzu01

文章 0 评论 0

§对你不离不弃

文章 0 评论 0

可遇━不可求

文章 0 评论 0

枕梦

文章 0 评论 0

qq_3LFa8Q

文章 0 评论 0

JP

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文