Java webMagic 如何爬取知乎回答?

发布于 2021-12-08 13:27:49 字数 2706 浏览 863 评论 4

用webmagic抓取知乎某个问题下的所有回答时候,每次只能获取前两条回答。

查了各种博客,试了各种方法,总是只返回2条回答,或者直接401。

o.a.h.impl.execchain.MainClientExec - Connection can be kept alive indefinitely
o.a.http.impl.auth.HttpAuthenticator - Authentication required
o.a.http.impl.auth.HttpAuthenticator - www.zhihu.com:443 requested authentication
o.a.http.impl.auth.HttpAuthenticator - Response contains no authentication challenges
o.a.h.c.p.ResponseProcessCookies - Cookie accepted [aliyungf_tc="AQAAAD1PxXQABgUA7CesO3+7/0/iFhJt", version:0, domain:www.zhihu.com, path:/, expiry:null]
o.a.h.i.c.PoolingHttpClientConnectionManager - Connection [id: 0][route: {s}->https://www.zhihu.com:443] can be kept alive indefinitely
o.a.h.i.c.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://www.zhihu.com:443][total kept alive: 1; route allocated: 1 of 100; total allocated: 1 of 1]
u.c.webmagic.utils.CharsetUtils - Auto get charset: null
u.c.w.d.HttpClientDownloader - Charset autodetect failed, use UTF-8 as charset. Please specify charset in Site.setCharset()
u.c.w.d.HttpClientDownloader - downloading page success https://www.zhihu.com/api/v4/questions/29688243/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=3&offset=3
09:04:14.908 [pool-1-thread-1] INFO  us.codecraft.webmagic.Spider - page status code error, page https://www.zhihu.com/api/v4/questions/29688243/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=3&offset=3 , code: 401
 

 

求各路大神指点迷津

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

把回忆走一遍 2021-12-10 02:03:35

请问,你这段话用代码实现了吧,能分享下这块代码学习下吗?谢谢!

蓝颜夕 2021-12-10 02:01:35

我前段时间爬取过,你查看源代码,数据前端预请求出来了,在JS script里面。当时我用解析里面的JSON数据拿的

凡尘雨 2021-12-10 00:10:42

java爬虫没接触, 弄过感觉还是用python的比较方便

尐偏执 2021-12-09 20:03:08

知乎做了很多防扒措施的

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文