使用 disqus 从网站检索评论
我想编写一个抓取脚本来检索 cnn 文章中的评论。例如,这篇文章: http:// /www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
我意识到 cnn 使用 disqus 进行评论 讨论。由于评论加载不是基于网页的(即上一页、下一页)并且是动态的(即需要单击“加载下25”),我不知道如何检索本文的所有5000+评论。
有什么想法或建议吗?
非常感谢!
I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我需要通过抓取带有 ajax 评论的页面来获取评论。因为它们没有在服务器上渲染,所以我不得不调用 disqus api。在源代码中,您将需要标识符代码:
另外,查看js源代码以获取页面公钥和论坛名称。将它们放在适当的 url 中。
我使用 javascript nodejs 来测试这一点,即:
I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
抓取(除了获取页面之外)的选项可能不太健壮(取决于您的需求),但将为您遇到的问题提供解决方案,即在成熟的网络浏览器周围使用某种包装器,并且从字面上对使用模式进行编码并提取相关数据。由于您没有提到您知道哪种编程语言,我将举 3 个示例:1)Watir - ruby,2)Watin - IE & Firefox 通过 .net,3) Selenium - IE 通过 C#/Java/Perl/PHP/Ruby/Python
我将提供一个使用 Watin & 的小例子。 C#:
注意:
我不熟悉 disqus,但通过循环链接和链接来强制显示所有评论可能是一个更好的选择。单击我发布的部分代码,直到所有注释都可见并刮掉列表元素 dsq-comments
The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments