属性错误:“NoneType”对象没有属性“strip”;使用 Python WebCrawler

发布于 2024-11-27 17:33:09 字数 3505 浏览 1 评论 0原文

我正在编写一个 python 程序来使用 urllib2、api 的 python twitter 包装器和 BeautifulSoup 的组合来抓取 twitter。但是,当我运行程序时,出现以下类型的错误:

ray_krueger RafaelNadal

Traceback (most recent call last):
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
    crawl(start_follower, output, depth)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
    request = urllib2.Request(new_url)
  File "C:\Python28\lib\urllib2.py", line 192, in __init__
    self.__original = unwrap(url)
  File "C:\Python28\lib\urllib.py", line 1038, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

我完全不熟悉这种类型的错误(Python 的新手)并且在线搜索它只得到很少的信息。我也附上了我的代码,但是你有什么建议吗?

谢谢 斯内希齐

import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re

start_follower = "NYTimeskrugman" 
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis

api = twitter.Api()

#want to also begin entire crawl with some sort of authentication service 

def site(follower):
    followersite = "http://mobile.twitter.com/" + follower
    return followersite

def getPage(follower): 
    thisfollowersite = site(follower)
    request = urllib2.Request(thisfollowersite)
    response = urllib2.urlopen(request)
    return response

def getSoup(response): 
    html = response.read()
    soup = BeautifulSoup(html)
    return soup

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def recordlinks(soup,output):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        a = tag.renderContents()
        b = str (a)
        output.write(b)
        output.write('\n\n')

def checkforstamp(soup):
    times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        if str(stamp) == '3 months ago':
            return True

def crawl(follower, output, in_depth):
    if in_depth > 0:
        output.write(follower)
        a = getPage(follower)
        new_soup = getSoup(a)
        recordlinks(new_soup, output)
        currenttime = False 
        while currenttime == False:
            new_url = get_more_tweets(new_soup)
            request = urllib2.Request(new_url)
            response = urllib2.urlopen(request)
            new_soup = getSoup(response)
            recordlinks(new_soup, output)
            currenttime = checkforstamp(new_soup)
        users = api.GetFriends(follower)
        for u in users[0:5]:
            x = u.screen_name 
            y = str(x)
            print y
            crawl(y, output, in_depth - 1)
            output.write('\n\n')
        output.write('\n\n\n')

crawl(start_follower, output, depth)
print("Program done. Look at output file.")

I'm writing a python program to crawl twitter using a combination of urllib2, the python twitter wrapper for the api, and BeautifulSoup. However, when I run my program, I get an error of the following type:

ray_krueger
RafaelNadal

Traceback (most recent call last):
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
    crawl(start_follower, output, depth)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
    crawl(y, output, in_depth - 1)
  File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
    request = urllib2.Request(new_url)
  File "C:\Python28\lib\urllib2.py", line 192, in __init__
    self.__original = unwrap(url)
  File "C:\Python28\lib\urllib.py", line 1038, in unwrap
    url = url.strip()
AttributeError: 'NoneType' object has no attribute 'strip'

I'm completely unfamiliar with this type of error (new to python) and searching for it online has yielded very little information. I've attached my code as well, but do you have any suggestions?

Thanx
Snehizzy

import twitter
import urllib
import urllib2
import htmllib
from BeautifulSoup import BeautifulSoup
import re

start_follower = "NYTimeskrugman" 
depth = 3
output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis

api = twitter.Api()

#want to also begin entire crawl with some sort of authentication service 

def site(follower):
    followersite = "http://mobile.twitter.com/" + follower
    return followersite

def getPage(follower): 
    thisfollowersite = site(follower)
    request = urllib2.Request(thisfollowersite)
    response = urllib2.urlopen(request)
    return response

def getSoup(response): 
    html = response.read()
    soup = BeautifulSoup(html)
    return soup

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def recordlinks(soup,output):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        a = tag.renderContents()
        b = str (a)
        output.write(b)
        output.write('\n\n')

def checkforstamp(soup):
    times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        if str(stamp) == '3 months ago':
            return True

def crawl(follower, output, in_depth):
    if in_depth > 0:
        output.write(follower)
        a = getPage(follower)
        new_soup = getSoup(a)
        recordlinks(new_soup, output)
        currenttime = False 
        while currenttime == False:
            new_url = get_more_tweets(new_soup)
            request = urllib2.Request(new_url)
            response = urllib2.urlopen(request)
            new_soup = getSoup(response)
            recordlinks(new_soup, output)
            currenttime = checkforstamp(new_soup)
        users = api.GetFriends(follower)
        for u in users[0:5]:
            x = u.screen_name 
            y = str(x)
            print y
            crawl(y, output, in_depth - 1)
            output.write('\n\n')
        output.write('\n\n\n')

crawl(start_follower, output, depth)
print("Program done. Look at output file.")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

撩发小公举 2024-12-04 17:33:09

AttributeError:“NoneType”对象没有属性“strip”

它的意思正是它所说的: url.strip() 需要首先弄清楚 url.strip 是什么,即寻找上 urlstrip 属性。失败是因为 url 是一个 'NoneType' 对象,即类型为 NoneType 的对象,即特殊对象 None.

据推测,url 应该是 str,即文本字符串,因为它们确实具有 strip 属性。

这发生在文件“C:\Python28\lib\urllib.py”中,即urllib模块。这不是您的代码,因此我们向后查看异常跟踪,直到找到我们编写的内容:request = urllib2.Request(new_url)。我们只能假设我们传递给 urllib2 模块的 new_url 最终成为 urllib 中某处的 url 变量。

那么 new_url 从哪里来呢?我们查找有问题的代码行(请注意,异常回溯中有一个行号),我们看到前一行是 new_url = get_more_tweets(new_soup),所以我们使用 get_more_tweets 的结果。

对此函数的分析表明,它会搜索一些链接,尝试找到标记为“更多”的链接,并为我们提供它找到的第一个此类链接的 URL。我们没有考虑的情况是没有这样的链接。在这种情况下,函数刚刚到达末尾,并隐式返回 None (这就是 Python 处理到达末尾而没有显式返回的函数的方式,因为 Python 中没有返回类型的规范,并且必须始终返回一个值) ,这就是该值的来源。

据推测,如果没有“更多”链接,那么我们根本不应该尝试点击该链接。因此,我们通过显式检查此 None 返回值来修复错误,并在这种情况下跳过 urllib2.Request,因为没有可跟踪的链接。

顺便说一下,这个 None 值对于尚未确定的 currenttime 来说是比 False 值更惯用的“占位符”值,您当前正在使用。您还可以考虑在变量和方法名称中使用下划线分隔单词时更加一致,以使内容更易于阅读。 :)

AttributeError: 'NoneType' object has no attribute 'strip'

It means exactly what it says: url.strip() requires first figuring out what url.strip is, i.e. looking up the strip attribute of url. This failed because url is a 'NoneType' object, i.e. an object whose type is NoneType, i.e. the special object None.

Presumably url was expected to be a str, i.e. a text string, since those do have a strip attribute.

This happened within File "C:\Python28\lib\urllib.py", i.e., the urllib module. That's not your code, so we look backwards through the exception trace until we find something we wrote: request = urllib2.Request(new_url). We can only presume that the new_url that we pass to the urllib2 module eventually becomes a url variable somewhere within urllib.

So where did new_url come from? We look up the line of code in question (notice that there is a line number in the exception traceback), and we see that the immediately previous line is new_url = get_more_tweets(new_soup), so we're using the result for get_more_tweets.

An analysis of this function shows that it searches through some links, tries to find one labelled 'more', and gives us the URL for the first such link that it finds. The case we haven't considered is when there are no such links. In this case, the function just reaches the end, and implicitly returns None (that's how Python handles functions that reach the end without an explicit return, since there is no specification of a return type in Python and since a value must always be returned), which is where that value is coming from.

Presumably, if there is no 'more' link, then we should not be attempting to follow the link at all. Therefore, we fix the error by explicitly checking for this None return value, and skipping the urllib2.Request in that case, since there is no link to follow.

By the way, this None value would be a more idiomatic "placeholder" value for the not-yet-determined currenttime than the False value that you are currently using. You might also consider being a little more consistent about separating words with underscores in your variable and method names to make things easier to read. :)

迷途知返 2024-12-04 17:33:09

当您

request = urllib2.Request(new_url)

crawl() 中执行此操作时,new_urlNone。当您从 get_more_tweets(new_soup) 获取 new_url 时,这意味着 get_more_tweets() 返回 None

这意味着 return d 永远不会被达到,这意味着 str(b) == 'more' 永远不会为 true,或者 soup.findAll() code> 没有返回任何链接,因此 for link in links 不执行任何操作。

When you do

request = urllib2.Request(new_url)

in crawl(), new_url is None. As you're getting new_url from get_more_tweets(new_soup), that means get_more_tweets() is returning None.

That means return d is never being reached, which means either str(b) == 'more' was never true, or soup.findAll() didn't return any links so for link in links does nothing.

勿忘心安 2024-12-04 17:33:09

当您执行以下操作时:request = urllib2.Request(new_url)new_url 应该是一个字符串,此错误表明它是None

您从 get_more_tweets 函数获取 new_url 的值,因此,它在某处返回 None

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

当我们查看这段代码时,该函数仅在某些链接上 str(b)=="more" 时返回,因此您的问题是“为什么 str(b)=="more" 从未发生? ”。

When you are doing: request = urllib2.Request(new_url), new_url supposed to be a string, this error says it's None.

You get new_url's value from get_more_tweets function, so, it returned None somewhere.

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

When we look at this code, the function returns only when str(b)=="more" on some link, so your problem is "Why never str(b)=="more" happens?".

漫雪独思 2024-12-04 17:33:09

您将 None 而不是字符串传递给 urllib2.Request()。查看代码,这意味着 new_url 有时是 None 。查看您的 get_more_tweets() 函数(该变量的来源),我们看到:

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

仅当 b"more 时,此函数才会返回一个值" 因为您的 return 语句在 if 下缩进。如果它等于任何其他值,则不返回任何值(即None)。

您需要始终在此处返回有效的 URL,或者需要在将其传递给 urllib2.Request() 之前检查 None 返回值。

You're passing None rather than a string to urllib2.Request(). Looking at the code, this means that new_url is None sometimes. And looking at your get_more_tweets() function, which is the source of this variable, we see this:

def get_more_tweets(soup): 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

This function is returning a value only if b is "more" because your return statement is indented under your if. If it is equal to any other value, no value (i.e. None) is returned.

You need to either always return a valid URL here, or you need to check for the None return value before passing it to urllib2.Request().

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文