AttributeError：＆＃x27; nonepy＆＃x27;对象没有属性＆＃x27;提取＆＃x27;

发布于 2025-02-03 00:10:29 字数 2573 浏览 2 评论 0原文

我试图从页面中排除DIV和NAV。第一次运行似乎很棒，但随后会引发错误。

从此页面：

=“ nofollow noreferrer”> https://discuss.dizzycoding.com/exclude-unwanted-tag-on-beautifulsoup-python/

我试图获取文章的文本（例如，在第五篇文章中是在第五篇文章中），但不是附件（）和nav。

控制台日志：

ps c：\ users \ thoma \ desktop \ py \ velkesvatonovice.cz \ scripts＆gt; Python main.py
Trackback（最近的最新通话）：文件
“ main.py”，第53行，在 dunctedAttachments.extract（）attributeError：'nontype'对象没有属性'extract'

代码的问题部分：

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

完整代码：

from bs4 import BeautifulSoup
import requests
import re
from csv import writer

pageno=1
url= "https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/page/"+str(pageno)+"/"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("article")

#65

def normalize(str):
    return(re.sub(r'\xa0', ' ', str))

with open("listings.csv", "w", encoding="utf8") as f:
    thewriter = writer(f)
    header= ["Name", "Text", "Text full" ,"Attachments" , "Category", "Category full", "Date", "URL", "Page"]
    thewriter.writerow(header)

    for list in lists:
        categorieslist=list.find_all("a", rel="category tag")

        #Name
        article=list.find("a", rel="bookmark").text.strip()
        
        #Text
        text=list.find("div", class_="entry excerpt entry-summary").text

        #Category
        category = (categorieslist[len(categorieslist)-1])
        
        #Category full
        categories=""
        for cat in categorieslist:
            categories += (cat.text + "/") 

        #Date
        date=list.find("time").text
        
        #URL
        urlarticle=list.find("a", rel="bookmark")["href"]

        pageart = requests.get(urlarticle)
        soupart = BeautifulSoup(pageart.content, "html.parser")
        artcontent = soupart.find("div", class_="entry-inner")
        
        #Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

        #Attachments

        #Page

        item = [normalize(article), normalize(text), "ss", "Attachment", category.text, categories, date, urlarticle]
        thewriter.writerow(item)

原文

Im trying to exclude a div and nav from a page. The first run seems to run great, but then it throws error.

From this page: https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/

"Exclude" code from: https://discuss.dizzycoding.com/exclude-unwanted-tag-on-beautifulsoup-python/

Im trying to get the text of an article (which is for example in the 5th article), but not the attachments (), and nav.

Console log:

PS C:\Users\thoma\Desktop\py\velkesvatonovice.cz\scripts> python
main.py
Traceback (most recent call last): File
"main.py", line 53, in
unwantedAttachments.extract() AttributeError: 'NoneType' object has no attribute 'extract'

Problematic part of the code:

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

Full code:

from bs4 import BeautifulSoup
import requests
import re
from csv import writer

pageno=1
url= "https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/page/"+str(pageno)+"/"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("article")

#65

def normalize(str):
    return(re.sub(r'\xa0', ' ', str))

with open("listings.csv", "w", encoding="utf8") as f:
    thewriter = writer(f)
    header= ["Name", "Text", "Text full" ,"Attachments" , "Category", "Category full", "Date", "URL", "Page"]
    thewriter.writerow(header)

    for list in lists:
        categorieslist=list.find_all("a", rel="category tag")

        #Name
        article=list.find("a", rel="bookmark").text.strip()
        
        #Text
        text=list.find("div", class_="entry excerpt entry-summary").text

        #Category
        category = (categorieslist[len(categorieslist)-1])
        
        #Category full
        categories=""
        for cat in categorieslist:
            categories += (cat.text + "/") 

        #Date
        date=list.find("time").text
        
        #URL
        urlarticle=list.find("a", rel="bookmark")["href"]

        pageart = requests.get(urlarticle)
        soupart = BeautifulSoup(pageart.content, "html.parser")
        artcontent = soupart.find("div", class_="entry-inner")
        
        #Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

        #Attachments

        #Page

        item = [normalize(article), normalize(text), "ss", "Attachment", category.text, categories, date, urlarticle]
        thewriter.writerow(item)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不醒的梦 2025-02-10 00:10:29

一个简单的“如果”解决了整个问题。 Thx @Ahmad指出了这一点。

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        if unwantedAttachments:
            unwantedAttachments.extract()

A simple "if" fixes the whole problem. Thx @Ahmad for pointing it out.

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        if unwantedAttachments:
            unwantedAttachments.extract()

回复收藏 0 原文

~没有更多了~