刮擦时,我得到了一些垃圾价值

发布于 2025-02-10 20:49:31 字数 1321 浏览 1 评论 0原文

大家好,请使用BS4检查以下代码以刮擦网页。

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.nfl.com/standings/league/2019/REG'
page = requests.get('https://www.nfl.com/standings/league/2019/REG')
soup = BeautifulSoup(page.text, 'lxml')

#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', {'summary':'Standings - Detailed View'})

#Gets all the column headers of our table
headers = []
for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)

#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)

#gets all our data within the table and adds it to our dataframe
for row in table.find_all('tr')[1:]:
    #line below fixes the formatting issue  with the team names
    first_td = row.find_all('td')[0].find('div', class_ = 'd3-o-club-fullname').text.strip()
    data = row.find_all('td')[1:]
    row_data = [td.text.strip() for td in data]
    row_data.insert(0,first_td)
    length = len(df)
    df.loc[length] = row_data

df.to_csv('F:/beautiful soup/tablefg.csv')

运行上述代码后,我将获得以下值的值。

在此处输入图像描述

在此图像中,我将获得2000的0个值。不知道为什么它是表现出来。它应该03-03-0,但输出为03-03-2000

Hi All please check the below code using bs4 to scrape the webpage.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.nfl.com/standings/league/2019/REG'
page = requests.get('https://www.nfl.com/standings/league/2019/REG')
soup = BeautifulSoup(page.text, 'lxml')

#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', {'summary':'Standings - Detailed View'})

#Gets all the column headers of our table
headers = []
for i in table.find_all('th'):
    title = i.text.strip()
    headers.append(title)

#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)

#gets all our data within the table and adds it to our dataframe
for row in table.find_all('tr')[1:]:
    #line below fixes the formatting issue  with the team names
    first_td = row.find_all('td')[0].find('div', class_ = 'd3-o-club-fullname').text.strip()
    data = row.find_all('td')[1:]
    row_data = [td.text.strip() for td in data]
    row_data.insert(0,first_td)
    length = len(df)
    df.loc[length] = row_data

df.to_csv('F:/beautiful soup/tablefg.csv')

After running the above code i am getting the values as below.

enter image description here

in this image for 0 value i am getting as 2000. Dont know why it's showing so. it should 03-03-0 but getting output as 03-03-2000

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

独享拥抱 2025-02-17 20:49:31

如果您使用的是大熊猫,则无需使桌子解析如此困难。您可以简单地做:

import pandas as pd

df = pd.read_html('https://www.nfl.com/standings/league/2019/REG')[0]
df.to_csv('F:/beautiful soup/tablefg.csv')

If you're using pandas, you don't need to make table parsing so difficult. You can simply do:

import pandas as pd

df = pd.read_html('https://www.nfl.com/standings/league/2019/REG')[0]
df.to_csv('F:/beautiful soup/tablefg.csv')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文