属性错误:“NoneType”对象没有属性“strip”;使用 Python WebCrawler
我正在编写一个 python 程序来使用 urllib2、api 的 python twitter 包装器和 BeautifulSoup 的组合来抓取 twitter。但是,当我运行程序时,出现以下类型的错误:
ray_krueger RafaelNadal
Traceback (most recent call last):
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
crawl(start_follower, output, depth)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
request = urllib2.Request(new_url)
File "C:\Python28\lib\urllib2.py", line 192, in __init__
self.__original = unwrap(url)
File "C:\Python28\lib\urllib.py", line 1038, in unwrap
url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
我完全不熟悉这种类型的错误(Python 的新手)并且在线搜索它只得到很少的信息。我也附上了我的代码,但是你有什么建议吗?
谢谢 斯内希齐
import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re
start_follower = "NYTimeskrugman"
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis
api = twitter.Api()
#want to also begin entire crawl with some sort of authentication service
def site(follower):
followersite = "http://mobile.twitter.com/" + follower
return followersite
def getPage(follower):
thisfollowersite = site(follower)
request = urllib2.Request(thisfollowersite)
response = urllib2.urlopen(request)
return response
def getSoup(response):
html = response.read()
soup = BeautifulSoup(html)
return soup
def get_more_tweets(soup):
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def recordlinks(soup,output):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
output.write(b)
output.write('\n\n')
def checkforstamp(soup):
times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
if str(stamp) == '3 months ago':
return True
def crawl(follower, output, in_depth):
if in_depth > 0:
output.write(follower)
a = getPage(follower)
new_soup = getSoup(a)
recordlinks(new_soup, output)
currenttime = False
while currenttime == False:
new_url = get_more_tweets(new_soup)
request = urllib2.Request(new_url)
response = urllib2.urlopen(request)
new_soup = getSoup(response)
recordlinks(new_soup, output)
currenttime = checkforstamp(new_soup)
users = api.GetFriends(follower)
for u in users[0:5]:
x = u.screen_name
y = str(x)
print y
crawl(y, output, in_depth - 1)
output.write('\n\n')
output.write('\n\n\n')
crawl(start_follower, output, depth)
print("Program done. Look at output file.")
I'm writing a python program to crawl twitter using a combination of urllib2, the python twitter wrapper for the api, and BeautifulSoup. However, when I run my program, I get an error of the following type:
ray_krueger
RafaelNadal
Traceback (most recent call last):
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
crawl(start_follower, output, depth)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
crawl(y, output, in_depth - 1)
File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
request = urllib2.Request(new_url)
File "C:\Python28\lib\urllib2.py", line 192, in __init__
self.__original = unwrap(url)
File "C:\Python28\lib\urllib.py", line 1038, in unwrap
url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'
I'm completely unfamiliar with this type of error (new to python) and searching for it online has yielded very little information. I've attached my code as well, but do you have any suggestions?
Thanx
Snehizzy
import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re
start_follower = "NYTimeskrugman"
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis
api = twitter.Api()
#want to also begin entire crawl with some sort of authentication service
def site(follower):
followersite = "http://mobile.twitter.com/" + follower
return followersite
def getPage(follower):
thisfollowersite = site(follower)
request = urllib2.Request(thisfollowersite)
response = urllib2.urlopen(request)
return response
def getSoup(response):
html = response.read()
soup = BeautifulSoup(html)
return soup
def get_more_tweets(soup):
links = soup.findAll('a', {'href': True}, {id : 'more_link'})
for link in links:
b = link.renderContents()
if str(b) == 'more':
c = link['href']
d = 'http://mobile.twitter.com' +c
return d
def recordlinks(soup,output):
tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
for tag in tags:
a = tag.renderContents()
b = str (a)
output.write(b)
output.write('\n\n')
def checkforstamp(soup):
times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
for time in times:
stamp = time.renderContents()
if str(stamp) == '3 months ago':
return True
def crawl(follower, output, in_depth):
if in_depth > 0:
output.write(follower)
a = getPage(follower)
new_soup = getSoup(a)
recordlinks(new_soup, output)
currenttime = False
while currenttime == False:
new_url = get_more_tweets(new_soup)
request = urllib2.Request(new_url)
response = urllib2.urlopen(request)
new_soup = getSoup(response)
recordlinks(new_soup, output)
currenttime = checkforstamp(new_soup)
users = api.GetFriends(follower)
for u in users[0:5]:
x = u.screen_name
y = str(x)
print y
crawl(y, output, in_depth - 1)
output.write('\n\n')
output.write('\n\n\n')
crawl(start_follower, output, depth)
print("Program done. Look at output file.")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
它的意思正是它所说的:
url.strip()
需要首先弄清楚url.strip
是什么,即寻找上url
的strip
属性。失败是因为url
是一个'NoneType' 对象
,即类型为NoneType
的对象,即特殊对象None.
据推测,
url
应该是str
,即文本字符串,因为它们确实具有strip
属性。这发生在
文件“C:\Python28\lib\urllib.py”
中,即urllib
模块。这不是您的代码,因此我们向后查看异常跟踪,直到找到我们编写的内容:request = urllib2.Request(new_url)
。我们只能假设我们传递给urllib2
模块的new_url
最终成为urllib
中某处的url
变量。那么
new_url
从哪里来呢?我们查找有问题的代码行(请注意,异常回溯中有一个行号),我们看到前一行是new_url = get_more_tweets(new_soup)
,所以我们使用get_more_tweets
的结果。对此函数的分析表明,它会搜索一些链接,尝试找到标记为“更多”的链接,并为我们提供它找到的第一个此类链接的 URL。我们没有考虑的情况是没有这样的链接。在这种情况下,函数刚刚到达末尾,并隐式返回 None (这就是 Python 处理到达末尾而没有显式返回的函数的方式,因为 Python 中没有返回类型的规范,并且必须始终返回一个值) ,这就是该值的来源。
据推测,如果没有“更多”链接,那么我们根本不应该尝试点击该链接。因此,我们通过显式检查此
None
返回值来修复错误,并在这种情况下跳过urllib2.Request
,因为没有可跟踪的链接。顺便说一下,这个
None
值对于尚未确定的currenttime
来说是比False
值更惯用的“占位符”值,您当前正在使用。您还可以考虑在变量和方法名称中使用下划线分隔单词时更加一致,以使内容更易于阅读。 :)It means exactly what it says:
url.strip()
requires first figuring out whaturl.strip
is, i.e. looking up thestrip
attribute ofurl
. This failed becauseurl
is a'NoneType' object
, i.e. an object whose type isNoneType
, i.e. the special objectNone
.Presumably
url
was expected to be astr
, i.e. a text string, since those do have astrip
attribute.This happened within
File "C:\Python28\lib\urllib.py"
, i.e., theurllib
module. That's not your code, so we look backwards through the exception trace until we find something we wrote:request = urllib2.Request(new_url)
. We can only presume that thenew_url
that we pass to theurllib2
module eventually becomes aurl
variable somewhere withinurllib
.So where did
new_url
come from? We look up the line of code in question (notice that there is a line number in the exception traceback), and we see that the immediately previous line isnew_url = get_more_tweets(new_soup)
, so we're using the result forget_more_tweets
.An analysis of this function shows that it searches through some links, tries to find one labelled 'more', and gives us the URL for the first such link that it finds. The case we haven't considered is when there are no such links. In this case, the function just reaches the end, and implicitly returns None (that's how Python handles functions that reach the end without an explicit return, since there is no specification of a return type in Python and since a value must always be returned), which is where that value is coming from.
Presumably, if there is no 'more' link, then we should not be attempting to follow the link at all. Therefore, we fix the error by explicitly checking for this
None
return value, and skipping theurllib2.Request
in that case, since there is no link to follow.By the way, this
None
value would be a more idiomatic "placeholder" value for the not-yet-determinedcurrenttime
than theFalse
value that you are currently using. You might also consider being a little more consistent about separating words with underscores in your variable and method names to make things easier to read. :)当您
在
crawl()
中执行此操作时,new_url
为None
。当您从get_more_tweets(new_soup)
获取new_url
时,这意味着get_more_tweets()
返回None
。这意味着
return d
永远不会被达到,这意味着str(b) == 'more'
永远不会为 true,或者soup.findAll()
code> 没有返回任何链接,因此for link in links
不执行任何操作。When you do
in
crawl()
,new_url
isNone
. As you're gettingnew_url
fromget_more_tweets(new_soup)
, that meansget_more_tweets()
is returningNone
.That means
return d
is never being reached, which means eitherstr(b) == 'more'
was never true, orsoup.findAll()
didn't return any links sofor link in links
does nothing.当您执行以下操作时:
request = urllib2.Request(new_url)
,new_url
应该是一个字符串,此错误表明它是None
。您从
get_more_tweets
函数获取 new_url 的值,因此,它在某处返回None
。当我们查看这段代码时,该函数仅在某些链接上
str(b)=="more"
时返回,因此您的问题是“为什么 str(b)=="more" 从未发生? ”。When you are doing:
request = urllib2.Request(new_url)
,new_url
supposed to be a string, this error says it'sNone
.You get new_url's value from
get_more_tweets
function, so, it returnedNone
somewhere.When we look at this code, the function returns only when
str(b)=="more"
on some link, so your problem is "Why never str(b)=="more" happens?".您将
None
而不是字符串传递给urllib2.Request()
。查看代码,这意味着new_url
有时是None
。查看您的get_more_tweets()
函数(该变量的来源),我们看到:仅当
b
为"more 时,此函数才会返回一个值"
因为您的return
语句在if
下缩进。如果它等于任何其他值,则不返回任何值(即None
)。您需要始终在此处返回有效的 URL,或者需要在将其传递给
urllib2.Request()
之前检查None
返回值。You're passing
None
rather than a string tourllib2.Request()
. Looking at the code, this means thatnew_url
isNone
sometimes. And looking at yourget_more_tweets()
function, which is the source of this variable, we see this:This function is returning a value only if
b
is"more"
because yourreturn
statement is indented under yourif
. If it is equal to any other value, no value (i.e.None
) is returned.You need to either always return a valid URL here, or you need to check for the
None
return value before passing it tourllib2.Request()
.