我无法用 BeautifulSoup 刮任何东西

发布于 2024-11-30 08:42:53 字数 381 浏览 0 评论 0原文

我正在使用 BeautifulSoup 来抓取一些网页内容。

我正在学习这个示例代码，但我总是得到“无”响应。

代码：

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())

post = soup.find('div', attrs={'id': 'topmenucontainer'})

print post

知道我做错了什么吗？

谢谢！！

原文

Im using BeautifulSoup to scrape some web contents.

Im learning with this example code,but I always get a "None" response.

Code:

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html').read())

post = soup.find('div', attrs={'id': 'topmenucontainer'})

print post

Any idea what Im doing wrong ?

Thanks!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情魔剑神 2024-12-07 08:42:53

我不认为你做错了什么。

令 BeautifulSoup 感到困惑的是第二个脚本标签。该标签看起来像这样：

<script type='text/javascript'>
<!--//--><![CDATA[//><!--
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])

function fixPNG(myImage) 
{
    if ((version >= 5.5) && (version < 7) && (document.body.filters)) 
    {
       var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
       var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
       var imgTitle = (myImage.title) ? 
                     "title='" + myImage.title  + "' " : "title='" + myImage.alt + "' "
       var imgStyle = "display:inline-block;" + myImage.style.cssText
       var strNewHTML = "<span " + imgID + imgClass + imgTitle
                  + " style=\"" + "width:" + myImage.width 
                  + "px; height:" + myImage.height 
                  + "px;" + imgStyle + ";"
                  + "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
                  + "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
       myImage.outerHTML = strNewHTML     
    }
}
//--><!]]>
</script>

但 BeatifulSoup 似乎认为它仍在注释或其他内容中，并将文件的其余部分作为脚本标签的内容包含在内。

尝试一下：

print str(soup.findAll('script')[1])[:2000]

你就会明白我的意思。

如果删除 CDATA，那么您应该会发现页面解析正确：

soup = BeautifulSoup(
    urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
    .read()
    .replace('<![CDATA[', '').replace('<!]]>', ''))

I don't think you are doing anything wrong.

It is the second script tag that is confusing BeautifulSoup. The tag looks like this:

<script type='text/javascript'>
<!--//--><![CDATA[//><!--
var arVersion = navigator.appVersion.split("MSIE")
var version = parseFloat(arVersion[1])

function fixPNG(myImage) 
{
    if ((version >= 5.5) && (version < 7) && (document.body.filters)) 
    {
       var imgID = (myImage.id) ? "id='" + myImage.id + "' " : ""
       var imgClass = (myImage.className) ? "class='" + myImage.className + "' " : ""
       var imgTitle = (myImage.title) ? 
                     "title='" + myImage.title  + "' " : "title='" + myImage.alt + "' "
       var imgStyle = "display:inline-block;" + myImage.style.cssText
       var strNewHTML = "<span " + imgID + imgClass + imgTitle
                  + " style=\"" + "width:" + myImage.width 
                  + "px; height:" + myImage.height 
                  + "px;" + imgStyle + ";"
                  + "filter:progid:DXImageTransform.Microsoft.AlphaImageLoader"
                  + "(src=\'" + myImage.src + "\', sizingMethod='scale');\"></span>"
       myImage.outerHTML = strNewHTML     
    }
}
//--><!]]>
</script>

but BeatifulSoup seems to think it is still in a comment or something and includes the rest of the file as content of the script tag.

Try:

print str(soup.findAll('script')[1])[:2000]

and you'll see what I mean.

If you remove the CDATA then you should find the page parses correctly:

soup = BeautifulSoup(
    urllib2.urlopen('http://www.velocidadcuchara.com/2011/08/helado-platano-light.html')
    .read()
    .replace('<![CDATA[', '').replace('<!]]>', ''))

回复收藏 0 原文

忆悲凉 2024-12-07 08:42:53

你的 HTML 有点奇怪。 BeautifulSoup 尽力了，但有时它就是无法解析它。

尝试将第一个元素移动到内，这可能会有所帮助。

回复收藏 0 原文

反目相谮 2024-12-07 08:42:53

您可以尝试使用 lxml lib。

lxml 文章

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
post = doc.cssselect('div#topmenucontainer')

You could try to use lxml lib.

lxml article

from lxml.html import parse
doc = parse('http://java.sun.com').getroot()
post = doc.cssselect('div#topmenucontainer')

回复收藏 0 原文

~没有更多了~

关于作者

娇纵

暂无简介

0 文章

0 评论

22 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

我无法用 BeautifulSoup 刮任何东西

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

我无法用 BeautifulSoup 刮任何东西

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。