使用Python BeautifulSoup获取在线新闻文章的评论数

发布于 2024-12-01 06:26:49 字数 907 浏览 1 评论 0原文

我正在使用以下代码:

import urllib
from BeautifulSoup import BeautifulSoup
import re 

comment_url = http://community.nytimes.com/comments/www.nytimes.com/2011/08/24/world/africa/24libya.html

response_new = urllib.urlopen(comment_url)
html_new = response.read()
soup_new = BeautifulSoup(html_new)
tags = soup_new.findAll('h3', {'class': 'share'})
for tag in tags:
    a = tag.renderContents()
    print a 

print "done!"

我试图通过使用 BeautifulSoup 解析器查找某些标签内的信息来获取读者对《纽约时报》某篇文章发表的评论数量。在标准的《纽约时报》文章社区页面上,信息的位置如下:

<p>Share your thoughts.</p> 
</div> 
<div id="commentsWell"> 
<div id="readerComments"> 
<div class="header clearfix"> 
<h3 class="share">185
 Readers' Comments</h3> 

但是,当我运行代码时,我只是得到“完成!”这个词。很明显,我的代码没有获取我指定的任何标签。我的问题是 - 我是否错误地使用了 BeautifulSoup?如果是这样,您建议我如何修改代码以获得所需的信息?

谢谢 斯内哈

I'm using the following code:

import urllib
from BeautifulSoup import BeautifulSoup
import re 

comment_url = http://community.nytimes.com/comments/www.nytimes.com/2011/08/24/world/africa/24libya.html

response_new = urllib.urlopen(comment_url)
html_new = response.read()
soup_new = BeautifulSoup(html_new)
tags = soup_new.findAll('h3', {'class': 'share'})
for tag in tags:
    a = tag.renderContents()
    print a 

print "done!"

I am trying to obtain the number of comments readers have made on a certain New York Times Article by using the BeautifulSoup parser to look for information within certain tags. On a standard NYTimes article community page, the information is located like this:

<p>Share your thoughts.</p> 
</div> 
<div id="commentsWell"> 
<div id="readerComments"> 
<div class="header clearfix"> 
<h3 class="share">185
 Readers' Comments</h3> 

However, when I run the code, I simply get the word "done!". It is apparent that my code isn't picking up any tags that I have specified. My question is - am I using BeautifulSoup incorrectly? If so, how would you suggest that I amend my code so as to get the desired information?

Thanks
Sneha

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

枕花眠 2024-12-08 06:26:49

显式使用 attrs 关键字参数:

tags = soup_new.findAll('h3', attrs={'class': 'share'})

findAll 的调用签名为:

soup_new.findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

因此,当您省略 attrs= 时,您将分配第二个参数 < code>{'class': 'share'},到 name 而不是 attrs

Use the attrs keyword parameter explicitly:

tags = soup_new.findAll('h3', attrs={'class': 'share'})

The call signature for findAll is:

soup_new.findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

so when you omit attrs=, you are assigning the second argument, {'class': 'share'}, to name rather than attrs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文