我无法用 BeautifulSoup 刮任何东西
我正在使用 BeautifulSoup 来抓取一些网页内容。
我正在学习这个示例代码,但我总是得到“无”响应。
代码:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())
post = soup.find('div', attrs={'id': 'topmenucontainer'})
print post
知道我做错了什么吗?
谢谢!!
Im using BeautifulSoup to scrape some web contents.
Im learning with this example code,but I always get a "None" response.
Code:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())
post = soup.find('div', attrs={'id': 'topmenucontainer'})
print post
Any idea what Im doing wrong ?
Thanks!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我不认为你做错了什么。
令 BeautifulSoup 感到困惑的是第二个脚本标签。该标签看起来像这样:
但 BeatifulSoup 似乎认为它仍在注释或其他内容中,并将文件的其余部分作为脚本标签的内容包含在内。
尝试一下:
你就会明白我的意思。
如果删除 CDATA,那么您应该会发现页面解析正确:
I don't think you are doing anything wrong.
It is the second script tag that is confusing BeautifulSoup. The tag looks like this:
but BeatifulSoup seems to think it is still in a comment or something and includes the rest of the file as content of the script tag.
Try:
and you'll see what I mean.
If you remove the CDATA then you should find the page parses correctly:
你的 HTML 有点奇怪。 BeautifulSoup 尽力了,但有时它就是无法解析它。
尝试将第一个
元素移动到
内,这可能会有所帮助。
Something weird with your HTML. BeautifulSoup tries its best, but sometimes it just can't parse it.
Try moving the first
<link>
element inside the<head>
, that might help.您可以尝试使用 lxml lib。
lxml 文章
You could try to use lxml lib.
lxml article