刮擦时,我得到了一些垃圾价值
大家好,请使用BS4检查以下代码以刮擦网页。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.nfl.com/standings/league/2019/REG'
page = requests.get('https://www.nfl.com/standings/league/2019/REG')
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', {'summary':'Standings - Detailed View'})
#Gets all the column headers of our table
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
#gets all our data within the table and adds it to our dataframe
for row in table.find_all('tr')[1:]:
#line below fixes the formatting issue with the team names
first_td = row.find_all('td')[0].find('div', class_ = 'd3-o-club-fullname').text.strip()
data = row.find_all('td')[1:]
row_data = [td.text.strip() for td in data]
row_data.insert(0,first_td)
length = len(df)
df.loc[length] = row_data
df.to_csv('F:/beautiful soup/tablefg.csv')
运行上述代码后,我将获得以下值的值。
在此图像中,我将获得2000的0个值。不知道为什么它是表现出来。它应该03-03-0,但输出为03-03-2000
Hi All please check the below code using bs4 to scrape the webpage.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.nfl.com/standings/league/2019/REG'
page = requests.get('https://www.nfl.com/standings/league/2019/REG')
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', {'summary':'Standings - Detailed View'})
#Gets all the column headers of our table
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
#gets all our data within the table and adds it to our dataframe
for row in table.find_all('tr')[1:]:
#line below fixes the formatting issue with the team names
first_td = row.find_all('td')[0].find('div', class_ = 'd3-o-club-fullname').text.strip()
data = row.find_all('td')[1:]
row_data = [td.text.strip() for td in data]
row_data.insert(0,first_td)
length = len(df)
df.loc[length] = row_data
df.to_csv('F:/beautiful soup/tablefg.csv')
After running the above code i am getting the values as below.
in this image for 0 value i am getting as 2000. Dont know why it's showing so. it should 03-03-0 but getting output as 03-03-2000
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您使用的是大熊猫,则无需使桌子解析如此困难。您可以简单地做:
If you're using pandas, you don't need to make table parsing so difficult. You can simply do: