获取属性＆＃x27;尝试刮擦网页时

发布于 2025-02-07 05:11:10 字数 2160 浏览 1 评论 0原文

我目前正在研究一个愚蠢的项目，以测试我的Python技能。我希望能够刮擦彩票游戏并进行分析以测量模式。

目标：

刮擦数字历史记录。包括日期。
将数据存储在文本或CSV中，
分析数据，并让计算机进行有根据的猜测。

我首先尝试通过直接下载HTML文件并将其存储在程序中以刮擦数据来刮擦数据。

from bs4 import BeautifulSoup


# Establishes soup and html data
with open('lotto_test.html', 'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')

# defines where the data will be written
file = open('Numbers.txt', 'w')

# finds data in class, prints and writes to file
for ul in soup.find_all('ul', class_='resultsNums'):
    numbers = ul.text
    print(numbers)
    file.write(numbers)

# same as ul, except for the date
for dates in soup.find_all('div', class_='resultsDrawDate'):
    dod = dates.text
    print(dod)
    file.write(dod)

file.flush()
file.close

尽管数据格式化有些奇怪，但这成功完成了我想要的内容。但是，我想收集尽可能多的数据。

我试图通过直接从网页刮擦数据来对此进行不同的处理。然后按页面上页，并从多年的图纸中存储数据。并以这种方式编写脚本：

from requests_html import HTMLSession
from bs4 import BeautifulSoup

s = HTMLSession()
url = 'https://www.lotterypost.com/game/98/results'

def get_data(url):
    r = s.get(url)
    r.html.render(sleep=1)
    soup = BeautifulSoup(r.html.html, 'html.parser')
    return soup
    
def next_page(soup):
    page = soup.find('div', {"class":"CSR-Paging"})
    if not page.find('span', {"class":"CSR-PrevNext"}):
        url = "https://www.lotterypost.com/game/98/results" + str(page.find('div', {'class':'CSR-Paging'}).find('a')['href'])
        return url
    else:
        return

while True:
    soup = get_data(url)
    url = next_page(soup)
    if not url:
        break
    print(url)

运行时，输出如下：

Traceback (most recent call last):
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 23, in <module>
    url = next_page(soup)
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 15, in next_page
    if not page.find('span', {"class":"CSR-PrevNext"}):
AttributeError: 'NoneType' object has no attribute 'find'

我不确定该怎么办。我认为.find（）是Python的内置功能。这是否意味着找不到我指定的属性？还是我到底要寻找什么属性感到困惑？提前致谢。

原文

I'm currently working on a silly project to test my Python skills. I want to be able to scrape data for a lottery game and analyze it to measure a pattern.

The goals:

Scrape numbers history. Including dates.
Store the data in a text or CSV
Analyze the data, and have the computer make an educated guess.

I first attempted to scrape the data by downloading the HTML files directly and storing it within the program to scrape.

from bs4 import BeautifulSoup


# Establishes soup and html data
with open('lotto_test.html', 'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')

# defines where the data will be written
file = open('Numbers.txt', 'w')

# finds data in class, prints and writes to file
for ul in soup.find_all('ul', class_='resultsNums'):
    numbers = ul.text
    print(numbers)
    file.write(numbers)

# same as ul, except for the date
for dates in soup.find_all('div', class_='resultsDrawDate'):
    dod = dates.text
    print(dod)
    file.write(dod)

file.flush()
file.close

Though the data is formatted a bit strangely, this successfully accomplished what I'm looking for. However, I would like to gather as much data as possible.

I am trying to approach this differently, by having the computer scrape the data straight from the webpage. Then going page by page, and storing data from a years worth of drawings. And have written the script this way:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

s = HTMLSession()
url = 'https://www.lotterypost.com/game/98/results'

def get_data(url):
    r = s.get(url)
    r.html.render(sleep=1)
    soup = BeautifulSoup(r.html.html, 'html.parser')
    return soup
    
def next_page(soup):
    page = soup.find('div', {"class":"CSR-Paging"})
    if not page.find('span', {"class":"CSR-PrevNext"}):
        url = "https://www.lotterypost.com/game/98/results" + str(page.find('div', {'class':'CSR-Paging'}).find('a')['href'])
        return url
    else:
        return

while True:
    soup = get_data(url)
    url = next_page(soup)
    if not url:
        break
    print(url)

When run, the output is as follows:

Traceback (most recent call last):
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 23, in <module>
    url = next_page(soup)
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 15, in next_page
    if not page.find('span', {"class":"CSR-PrevNext"}):
AttributeError: 'NoneType' object has no attribute 'find'

I'm not sure what to do from here. I thought .find() was a built in function of Python. Does this mean it is not finding the attribute I specified? Or is it confused about what attribute I am looking for exactly? Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久夏青 2025-02-14 05:11:10

您的页变量是无。因此，您无法对其进行操作。如果是这种情况，则应在条件块中添加一个预检查到该条件块的情况。

我已经分开了这些条件块，以使其更加清楚。另外，为了清楚起见，我对此进行了过多的评价。

def next_page(soup):
    # Retrieve the `CSR-Paging` element.
    page = soup.find("div", {"class": "CSR-Paging"})

    # We couldn't find a `CSR-Paging` element, so return `None`.
    if not page:
        return

    # We found a `CSR-PrevNext` element, so return `None`.
    if page.find("span", {"class": "CSR-PrevNext"}):
        return

    # Retrieve the `CSR-Paging` element's nearest anchor.
    path = str(page.find("div", {"class": "CSR-Paging"}).find("a")["href"])

    # Return the path appended to the page constant.
    return "https://www.lotterypost.com/game/98/results" + path

Your page variable is None. Therefore, you can't operate on it. You should add a pre-check to that conditional block to fail-early if this is the case.

I have separated out these conditional blocks in an attempt to make this more clear. Also, I have over-commented this for clarity.

def next_page(soup):
    # Retrieve the `CSR-Paging` element.
    page = soup.find("div", {"class": "CSR-Paging"})

    # We couldn't find a `CSR-Paging` element, so return `None`.
    if not page:
        return

    # We found a `CSR-PrevNext` element, so return `None`.
    if page.find("span", {"class": "CSR-PrevNext"}):
        return

    # Retrieve the `CSR-Paging` element's nearest anchor.
    path = str(page.find("div", {"class": "CSR-Paging"}).find("a")["href"])

    # Return the path appended to the page constant.
    return "https://www.lotterypost.com/game/98/results" + path

回复收藏 0 原文

~没有更多了~