AttributeError:' nonepy'对象没有属性'提取'

发布于 2025-02-03 00:10:29 字数 2573 浏览 2 评论 0原文

我试图从页面中排除DIV和NAV。第一次运行似乎很棒,但随后会引发错误。

从此页面:

​=“ nofollow noreferrer”> https://discuss.dizzycoding.com/exclude-unwanted-tag-on-beautifulsoup-python/

我试图获取文章的文本(例如,在第五篇文章中是在第五篇文章中),但不是附件()和nav。

控制台日志:

ps c:\ users \ thoma \ desktop \ py \ velkesvatonovice.cz \ scripts> Python main.py

Trackback(最近的最新通话):文件

“ main.py”,第53行,在 dunctedAttachments.extract()attributeError:'nontype'对象没有属性'extract'

代码的问题部分:

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

完整代码:

from bs4 import BeautifulSoup
import requests
import re
from csv import writer

pageno=1
url= "https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/page/"+str(pageno)+"/"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("article")

#65

def normalize(str):
    return(re.sub(r'\xa0', ' ', str))

with open("listings.csv", "w", encoding="utf8") as f:
    thewriter = writer(f)
    header= ["Name", "Text", "Text full" ,"Attachments" , "Category", "Category full", "Date", "URL", "Page"]
    thewriter.writerow(header)

    for list in lists:
        categorieslist=list.find_all("a", rel="category tag")

        #Name
        article=list.find("a", rel="bookmark").text.strip()
        
        #Text
        text=list.find("div", class_="entry excerpt entry-summary").text

        #Category
        category = (categorieslist[len(categorieslist)-1])
        
        #Category full
        categories=""
        for cat in categorieslist:
            categories += (cat.text + "/") 

        #Date
        date=list.find("time").text
        
        #URL
        urlarticle=list.find("a", rel="bookmark")["href"]

        pageart = requests.get(urlarticle)
        soupart = BeautifulSoup(pageart.content, "html.parser")
        artcontent = soupart.find("div", class_="entry-inner")
        
        #Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

        #Attachments

        #Page

        item = [normalize(article), normalize(text), "ss", "Attachment", category.text, categories, date, urlarticle]
        thewriter.writerow(item)

Im trying to exclude a div and nav from a page. The first run seems to run great, but then it throws error.

From this page: https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/

"Exclude" code from: https://discuss.dizzycoding.com/exclude-unwanted-tag-on-beautifulsoup-python/

Im trying to get the text of an article (which is for example in the 5th article), but not the attachments (), and nav.

Console log:

PS C:\Users\thoma\Desktop\py\velkesvatonovice.cz\scripts> python
main.py

Traceback (most recent call last): File

"main.py", line 53, in
unwantedAttachments.extract() AttributeError: 'NoneType' object has no attribute 'extract'

Problematic part of the code:

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

Full code:

from bs4 import BeautifulSoup
import requests
import re
from csv import writer

pageno=1
url= "https://www.velkesvatonovice.cz/windex.php/rubrika/elektronicka-uredni-deska/page/"+str(pageno)+"/"
page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")
lists = soup.find_all("article")

#65

def normalize(str):
    return(re.sub(r'\xa0', ' ', str))

with open("listings.csv", "w", encoding="utf8") as f:
    thewriter = writer(f)
    header= ["Name", "Text", "Text full" ,"Attachments" , "Category", "Category full", "Date", "URL", "Page"]
    thewriter.writerow(header)

    for list in lists:
        categorieslist=list.find_all("a", rel="category tag")

        #Name
        article=list.find("a", rel="bookmark").text.strip()
        
        #Text
        text=list.find("div", class_="entry excerpt entry-summary").text

        #Category
        category = (categorieslist[len(categorieslist)-1])
        
        #Category full
        categories=""
        for cat in categorieslist:
            categories += (cat.text + "/") 

        #Date
        date=list.find("time").text
        
        #URL
        urlarticle=list.find("a", rel="bookmark")["href"]

        pageart = requests.get(urlarticle)
        soupart = BeautifulSoup(pageart.content, "html.parser")
        artcontent = soupart.find("div", class_="entry-inner")
        
        #Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        unwantedAttachments.extract()
        unwantedNav = artcontent.find('nav')
        unwantedNav.extract()
        print(artcontent)

        #Attachments

        #Page

        item = [normalize(article), normalize(text), "ss", "Attachment", category.text, categories, date, urlarticle]
        thewriter.writerow(item)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不醒的梦 2025-02-10 00:10:29

一个简单的“如果”解决了整个问题。 Thx @Ahmad指出了这一点。

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        if unwantedAttachments:
            unwantedAttachments.extract()

A simple "if" fixes the whole problem. Thx @Ahmad for pointing it out.

#Text full
        unwantedAttachments = artcontent.find('div', class_="attachments")
        if unwantedAttachments:
            unwantedAttachments.extract()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文