为什么 Beautiful Soup 无法显示所有表中的数据？

发布于 2024-11-08 02:49:13 字数 1861 浏览 0 评论 0原文

一周前我尝试抓取维基百科页面。但我无法弄清楚为什么 Beautiful Soup 只会显示表列中的一些字符串，而其他表列则显示“无”。

注意：表列均包含数据。

我的程序将提取带有标签“description”的所有表列。我正在尝试从表中提取所有描述。

我正在抓取的网站是： http://en.wikipedia.org/wiki/Supernatural_(season_6< /a>）

这是我的代码：

from BeautifulSoup import BeautifulSoup 
import urllib
import sys
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24'


def printList(rowList):
    for row in rowList:
        print row
        print '\n'

    return

url = "http://en.wikipedia.org/wiki/Supernatural_(season_6)"

#f = urllib.urlopen(url)
#content = f.read()
#f.close

myopener = MyOpener()
page = myopener.open(url)
content = page.read()
page.close()

soup = BeautifulSoup(''.join(content))
soup.prettify()

movieList = []

rowListTitle = soup.findAll('tr', 'vevent')
print len(rowListTitle)

#printList(rowListTitle)
for row in rowListTitle:
    col = row.next # explain this?
    if col != 'None':
        col = col.findNext("b")
        movieTitle = col.string
        movieTuple = (movieTitle,'')
        movieList.append(movieTuple)

#printList(movieList)

for row in movieList:
    print row[0]

rowListDescription = soup.findAll('td' , 'description')
print len(rowListDescription)


index = 1;
while ( index < len(rowListDescription) ):
    description = rowListDescription[index]
    print description
    print description.string
    str = description
    print '####################################'
    movieList[index - 1] = (movieList[index - 1][0],description)
    index = index + 1

我没有粘贴输出，因为它真的很长，但输出确实很奇怪，因为它确实设法捕获中的信息，但是当我做一个.string，它给了我一个空内容。

原文

I tried to page scrape wikipedia a week ago. But i could not figure out why Beautiful Soup will only show some string from the table column and show "none" for other table column.

NOTE: the table column all contains data.

My program will extract all table columns with the tag "description". I am trying to extract all the description from the table.

The website I am scraping is: http://en.wikipedia.org/wiki/Supernatural_(season_6)

This is my code:

from BeautifulSoup import BeautifulSoup 
import urllib
import sys
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24'


def printList(rowList):
    for row in rowList:
        print row
        print '\n'

    return

url = "http://en.wikipedia.org/wiki/Supernatural_(season_6)"

#f = urllib.urlopen(url)
#content = f.read()
#f.close

myopener = MyOpener()
page = myopener.open(url)
content = page.read()
page.close()

soup = BeautifulSoup(''.join(content))
soup.prettify()

movieList = []

rowListTitle = soup.findAll('tr', 'vevent')
print len(rowListTitle)

#printList(rowListTitle)
for row in rowListTitle:
    col = row.next # explain this?
    if col != 'None':
        col = col.findNext("b")
        movieTitle = col.string
        movieTuple = (movieTitle,'')
        movieList.append(movieTuple)

#printList(movieList)

for row in movieList:
    print row[0]

rowListDescription = soup.findAll('td' , 'description')
print len(rowListDescription)


index = 1;
while ( index < len(rowListDescription) ):
    description = rowListDescription[index]
    print description
    print description.string
    str = description
    print '####################################'
    movieList[index - 1] = (movieList[index - 1][0],description)
    index = index + 1

I did not paste the output as it is really long. But the output is really weird as it did managed to capture the information in the <td> but when i do a .string, it gives me an empty content.

分享到QQ

分享到微博