获取属性'尝试刮擦网页时

发布于 2025-02-07 05:11:10 字数 2160 浏览 1 评论 0原文

我目前正在研究一个愚蠢的项目,以测试我的Python技能。我希望能够刮擦彩票游戏并进行分析以测量模式。

目标:

  • 刮擦数字历史记录。包括日期。
  • 将数据存储在文本或CSV中,
  • 分析数据,并让计算机进行有根据的猜测。

我首先尝试通过直接下载HTML文件并将其存储在程序中以刮擦数据来刮擦数据。

from bs4 import BeautifulSoup


# Establishes soup and html data
with open('lotto_test.html', 'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')

# defines where the data will be written
file = open('Numbers.txt', 'w')

# finds data in class, prints and writes to file
for ul in soup.find_all('ul', class_='resultsNums'):
    numbers = ul.text
    print(numbers)
    file.write(numbers)

# same as ul, except for the date
for dates in soup.find_all('div', class_='resultsDrawDate'):
    dod = dates.text
    print(dod)
    file.write(dod)

file.flush()
file.close

尽管数据格式化有些奇怪,但这成功完成了我想要的内容。但是,我想收集尽可能多的数据。

我试图通过直接从网页刮擦数据来对此进行不同的处理。然后按页面上页,并从多年的图纸中存储数据。并以这种方式编写脚本:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

s = HTMLSession()
url = 'https://www.lotterypost.com/game/98/results'

def get_data(url):
    r = s.get(url)
    r.html.render(sleep=1)
    soup = BeautifulSoup(r.html.html, 'html.parser')
    return soup
    
def next_page(soup):
    page = soup.find('div', {"class":"CSR-Paging"})
    if not page.find('span', {"class":"CSR-PrevNext"}):
        url = "https://www.lotterypost.com/game/98/results" + str(page.find('div', {'class':'CSR-Paging'}).find('a')['href'])
        return url
    else:
        return

while True:
    soup = get_data(url)
    url = next_page(soup)
    if not url:
        break
    print(url)

运行时,输出如下:

Traceback (most recent call last):
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 23, in <module>
    url = next_page(soup)
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 15, in next_page
    if not page.find('span', {"class":"CSR-PrevNext"}):
AttributeError: 'NoneType' object has no attribute 'find'

我不确定该怎么办。我认为.find()是Python的内置功能。这是否意味着找不到我指定的属性?还是我到底要寻找什么属性感到困惑?提前致谢。

I'm currently working on a silly project to test my Python skills. I want to be able to scrape data for a lottery game and analyze it to measure a pattern.

The goals:

  • Scrape numbers history. Including dates.
  • Store the data in a text or CSV
  • Analyze the data, and have the computer make an educated guess.

I first attempted to scrape the data by downloading the HTML files directly and storing it within the program to scrape.

from bs4 import BeautifulSoup


# Establishes soup and html data
with open('lotto_test.html', 'r') as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, 'lxml')

# defines where the data will be written
file = open('Numbers.txt', 'w')

# finds data in class, prints and writes to file
for ul in soup.find_all('ul', class_='resultsNums'):
    numbers = ul.text
    print(numbers)
    file.write(numbers)

# same as ul, except for the date
for dates in soup.find_all('div', class_='resultsDrawDate'):
    dod = dates.text
    print(dod)
    file.write(dod)

file.flush()
file.close

Though the data is formatted a bit strangely, this successfully accomplished what I'm looking for. However, I would like to gather as much data as possible.

I am trying to approach this differently, by having the computer scrape the data straight from the webpage. Then going page by page, and storing data from a years worth of drawings. And have written the script this way:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

s = HTMLSession()
url = 'https://www.lotterypost.com/game/98/results'

def get_data(url):
    r = s.get(url)
    r.html.render(sleep=1)
    soup = BeautifulSoup(r.html.html, 'html.parser')
    return soup
    
def next_page(soup):
    page = soup.find('div', {"class":"CSR-Paging"})
    if not page.find('span', {"class":"CSR-PrevNext"}):
        url = "https://www.lotterypost.com/game/98/results" + str(page.find('div', {'class':'CSR-Paging'}).find('a')['href'])
        return url
    else:
        return

while True:
    soup = get_data(url)
    url = next_page(soup)
    if not url:
        break
    print(url)

When run, the output is as follows:

Traceback (most recent call last):
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 23, in <module>
    url = next_page(soup)
  File "/Volumes/Local NAS/Cross Platform Files/Lotto Calc/scraper.py", line 15, in next_page
    if not page.find('span', {"class":"CSR-PrevNext"}):
AttributeError: 'NoneType' object has no attribute 'find'

I'm not sure what to do from here. I thought .find() was a built in function of Python. Does this mean it is not finding the attribute I specified? Or is it confused about what attribute I am looking for exactly? Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

久夏青 2025-02-14 05:11:10

您的变量是。因此,您无法对其进行操作。如果是这种情况,则应在条件块中添加一个预检查到该条件块的情况。

我已经分开了这些条件块,以使其更加清楚。另外,为了清楚起见,我对此进行了过多的评价。

def next_page(soup):
    # Retrieve the `CSR-Paging` element.
    page = soup.find("div", {"class": "CSR-Paging"})

    # We couldn't find a `CSR-Paging` element, so return `None`.
    if not page:
        return

    # We found a `CSR-PrevNext` element, so return `None`.
    if page.find("span", {"class": "CSR-PrevNext"}):
        return

    # Retrieve the `CSR-Paging` element's nearest anchor.
    path = str(page.find("div", {"class": "CSR-Paging"}).find("a")["href"])

    # Return the path appended to the page constant.
    return "https://www.lotterypost.com/game/98/results" + path

Your page variable is None. Therefore, you can't operate on it. You should add a pre-check to that conditional block to fail-early if this is the case.

I have separated out these conditional blocks in an attempt to make this more clear. Also, I have over-commented this for clarity.

def next_page(soup):
    # Retrieve the `CSR-Paging` element.
    page = soup.find("div", {"class": "CSR-Paging"})

    # We couldn't find a `CSR-Paging` element, so return `None`.
    if not page:
        return

    # We found a `CSR-PrevNext` element, so return `None`.
    if page.find("span", {"class": "CSR-PrevNext"}):
        return

    # Retrieve the `CSR-Paging` element's nearest anchor.
    path = str(page.find("div", {"class": "CSR-Paging"}).find("a")["href"])

    # Return the path appended to the page constant.
    return "https://www.lotterypost.com/game/98/results" + path
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文