使用 disqus 从网站检索评论

发布于 2024-12-27 21:26:51 字数 349 浏览 3 评论 0原文

我想编写一个抓取脚本来检索 cnn 文章中的评论。例如,这篇文章: http:// /www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1

我意识到 cnn 使用 disqus 进行评论 讨论。由于评论加载不是基于网页的(即上一页、下一页)并且是动态的(即需要单击“加载下25”),我不知道如何检索本文的所有5000+评论。

有什么想法或建议吗?

非常感谢!

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1

I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.

Any idea or suggestion?

Thanks so much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

生死何惧 2025-01-03 21:26:51

我需要通过抓取带有 ajax 评论的页面来获取评论。因为它们没有在服务器上渲染,所以我不得不调用 disqus api。在源代码中,您将需要标识符代码:

var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request

另外,查看js源代码以获取页面公钥和论坛名称。将它们放在适当的 url 中。

我使用 javascript nodejs 来测试这一点,即:

var request = require("request");

var publicKey  = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";

var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";

request(disqusUri, function(res,status,err){
    console.log(res.body);

    if(err){
        console.log("ERR: " + err);
    }
});

I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:

var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request

also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.

I used javascript nodejs to test this, ie :

var request = require("request");

var publicKey  = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";

var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";

request(disqusUri, function(res,status,err){
    console.log(res.body);

    if(err){
        console.log("ERR: " + err);
    }
});
流心雨 2025-01-03 21:26:51

抓取(除了获取页面之外)的选项可能不太健壮(取决于您的需求),但将为您遇到的问题提供解决方案,即在成熟的网络浏览器周围使用某种包装器,并且从字面上对使用模式进行编码并提取相关数据。由于您没有提到您知道哪种编程语言,我将举 3 个示例:1)Watir - ruby​​,2)Watin - IE & Firefox 通过 .net,3) Selenium - IE 通过 C#/Java/Perl/PHP/Ruby/Python

我将提供一个使用 Watin & 的小例子。 C#:

IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing

注意:
我不熟悉 disqus,但通过循环链接和链接来强制显示所有评论可能是一个更好的选择。单击我发布的部分代码,直到所有注释都可见并刮掉列表元素 dsq-comments

The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python

I'll provide a little example using Watin & C#:

IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing

Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文