为什么 Beautiful Soup 无法显示所有表中的数据?
一周前我尝试抓取维基百科页面。但我无法弄清楚为什么 Beautiful Soup 只会显示表列中的一些字符串,而其他表列则显示“无”。
注意:表列均包含数据。
我的程序将提取带有标签“description”的所有表列。我正在尝试从表中提取所有描述。
我正在抓取的网站是: http://en.wikipedia.org/wiki/Supernatural_(season_6< /a>)
这是我的代码:
from BeautifulSoup import BeautifulSoup
import urllib
import sys
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24'
def printList(rowList):
for row in rowList:
print row
print '\n'
return
url = "http://en.wikipedia.org/wiki/Supernatural_(season_6)"
#f = urllib.urlopen(url)
#content = f.read()
#f.close
myopener = MyOpener()
page = myopener.open(url)
content = page.read()
page.close()
soup = BeautifulSoup(''.join(content))
soup.prettify()
movieList = []
rowListTitle = soup.findAll('tr', 'vevent')
print len(rowListTitle)
#printList(rowListTitle)
for row in rowListTitle:
col = row.next # explain this?
if col != 'None':
col = col.findNext("b")
movieTitle = col.string
movieTuple = (movieTitle,'')
movieList.append(movieTuple)
#printList(movieList)
for row in movieList:
print row[0]
rowListDescription = soup.findAll('td' , 'description')
print len(rowListDescription)
index = 1;
while ( index < len(rowListDescription) ):
description = rowListDescription[index]
print description
print description.string
str = description
print '####################################'
movieList[index - 1] = (movieList[index - 1][0],description)
index = index + 1
我没有粘贴输出,因为它真的很长,但输出确实很奇怪,因为它确实设法捕获 中的信息,但是当我做一个
.string
,它给了我一个空内容。
I tried to page scrape wikipedia a week ago. But i could not figure out why Beautiful Soup will only show some string from the table column and show "none" for other table column.
NOTE: the table column all contains data.
My program will extract all table columns with the tag "description". I am trying to extract all the description from the table.
The website I am scraping is: http://en.wikipedia.org/wiki/Supernatural_(season_6)
This is my code:
from BeautifulSoup import BeautifulSoup
import urllib
import sys
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.65 Safari/534.24'
def printList(rowList):
for row in rowList:
print row
print '\n'
return
url = "http://en.wikipedia.org/wiki/Supernatural_(season_6)"
#f = urllib.urlopen(url)
#content = f.read()
#f.close
myopener = MyOpener()
page = myopener.open(url)
content = page.read()
page.close()
soup = BeautifulSoup(''.join(content))
soup.prettify()
movieList = []
rowListTitle = soup.findAll('tr', 'vevent')
print len(rowListTitle)
#printList(rowListTitle)
for row in rowListTitle:
col = row.next # explain this?
if col != 'None':
col = col.findNext("b")
movieTitle = col.string
movieTuple = (movieTitle,'')
movieList.append(movieTuple)
#printList(movieList)
for row in movieList:
print row[0]
rowListDescription = soup.findAll('td' , 'description')
print len(rowListDescription)
index = 1;
while ( index < len(rowListDescription) ):
description = rowListDescription[index]
print description
print description.string
str = description
print '####################################'
movieList[index - 1] = (movieList[index - 1][0],description)
index = index + 1
I did not paste the output as it is really long. But the output is really weird as it did managed to capture the information in the <td>
but when i do a .string
, it gives me an empty content.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
所有描述字符串都为空吗?从文档中:
在这种情况下,描述通常具有子节点,即:指向另一篇维基百科文章的
链接。这算作一个非字符串子节点,在这种情况下,描述节点的
string
设置为None
。Do all the description strings come up empty? From the documentation:
In this case, the description often have child nodes, i.e.: a
<a>
link to another Wikipedia article. This counts as a non-string child node, in which casestring
for the description node is set toNone
.