使用 Python 探索 NFL 选秀
在读过 三两个 Michael Lopez 关于 NFL 选秀的 文章 后,我决定使用 Python(取代 R),来重现他的分析。
首先,让我们导入将用到的大部分东东。
注意 :你可以 在这里 找到这篇文章的 github 仓库。它包括这个 notebook,数据和我所使用的 conda 环境 。
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
网页抓取
在我们开始之前,需要一些数据。我们将从 Pro-Football-Reference 那里抓取选秀数据,然后清理它们以进行分析。
我们会使用 BeautifulSoup
来抓取数据,然后将其存储到一个 pandas
Dataframe
中。
要感受下该数据,让我们看看 1967 选秀 。
上面仅是在页面上找到的一小部分选秀表格。我们将提取列标题的第二行以及每个选择的所有信息。虽然采用这种方式,我们还会抓取每个选手的 Pro-Football- Reference 选手页面链接以及大学统计链接。这样,如果将来想要从他们的选手页面抽取数据,就可以做到了。
In [2]:
# The url we will be scraping
url_1967 = "http://www.pro-football-reference.com/years/1967/draft.htm"
# get the html
html = urlopen(url_1967)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
抓取列标题
我们的 DataFrame
所需的列标题位于 PFR 表单的列标题的第二行。我们将抓取它,并且为两个额外的球员页面链接添加两个额外的列标题。
In [3]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in
soup.findAll('tr', limit=2)[1].findAll('th')]
# Add the two additional column headers for the player links
column_headers.extend(["Player_NFL_Link", "Player_NCAA_Link"])
抓取数据
使用 CSS 选择器 "#draft tr"
,我们可以很容易地提出数据行。我们基本上做的是,选择 id 值为 "draft"
的 HTML 元素内的表行元素。
谈到查找 CSS 选择器,一个非常有用的工具是 SelectorGadget 。这是一个网络扩展,它允许你点击一个网页的不同元素,然后为那些所选的元素提供 CSS 选择题。
In [4]:
# The data is found within the table rows of the element with id=draft
# We want the elements from the 3rd row and on
table_rows = soup.select("#drafts tr")[2:]
注意到, table_rows
是一个标签元素列表。
In [5]:
type(table_rows)
Out[5]:
list
In [6]:
type(table_rows[0])
Out[6]:
bs4.element.Tag
In [7]:
table_rows[0] # take a look at the first row
Out[7]:
1
1
[BAL]( http://savvastjortjoglou.com/teams/clt/1967_draft.htm "Baltimore olts)
[Bubba Smith]( http://savvastjortjoglou.com/players/S/SmitBu00.htm)
DE
22
1976
1
2
6
62
46
111
[Michigan St.]( http://savvastjortjoglou.com/schools/michiganst/)
[College Stats]( http://www.sports-reference.com/cfb/players/bubba-smith-2.tml)
在 td
(或者表格数据) 元素内,可以找到对于每个球员,我们所要的数据。
下面,我创建了一个函数,它从 table_rows
中抽取我们想要的数据。注释会带你看到该函数的每个部分做了什么。
In [8]:
def extract_player_data(table_rows):
"""
Extract and return the the desired information from the td elements within
the table rows.
"""
# create the empty list to store the player data
player_data = []
for row in table_rows: # for each row do the following
# Get the text for each table data (td) element in the row
# Some player names end with ' HOF', if they do, get the text excluding
# those last 4 characters,
# otherwise get all the text data from the table data
player_list = [td.get_text()[:-4] if td.get_text().endswith(" HOF")
else td.get_text() for td in row.find_all("td")]
# there are some empty table rows, which are the repeated
# column headers in the table
# we skip over those rows and and continue the for loop
if not player_list:
continue
# Extracting the player links
# Instead of a list we create a dictionary, this way we can easily
# match the player name with their pfr url
# For all "a" elements in the row, get the text
# NOTE: Same " HOF" text issue as the player_list above
links_dict = {(link.get_text()[:-4] # exclude the last 4 characters
if link.get_text().endswith(" HOF") # if they are " HOF"
# else get all text, set thet as the dictionary key
# and set the url as the value
else link.get_text()) : link["href"]
for link in row.find_all("a", href=True)}
# The data we want from the dictionary can be extracted using the
# player's name, which returns us their pfr url, and "College Stats"
# which returns us their college stats page
# add the link associated to the player's pro-football-reference page,
# or en empty string if there is no link
player_list.append(links_dict.get(player_list[3], ""))
# add the link for the player's college stats or an empty string
# if ther is no link
player_list.append(links_dict.get("College Stats", ""))
# Now append the data to list of data
player_data.append(player_list)
return player_data
现在,我们可以使用来自 1967 年选秀的数据来创建 DataFrame
。
In [9]:
# extract the data we want
data = extract_player_data(table_rows)
# and then store it in a DataFrame
df_1967 = pd.DataFrame(data, columns=column_headers)
In [10]:
df_1967.head()
Out[10]:
Rnd | Pick | Tm | Pos | Age | To | AP1 | PB | St | ... | TD | Rec | Yds | TD | Int | Sk | College/Univ | Player_NFL_Link | Player_NCAA_Link | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | BAL | Bubba Smith | DE | 22 | 1976 | 1 | 2 | 6 | ... | Michigan St. | College Stats | /players/S/SmitBu00.htm | http://www.sports-reference.com/cfb/players/bu ... | |||||||
1 | 1 | 2 | MIN | Clint Jones | RB | 22 | 1973 | 0 | 0 | 2 | ... | 20 | 38 | 431 | 0 | Michigan St. | College Stats | /players/J/JoneCl00.htm | http://www.sports-reference.com/cfb/players/cl ... | |||
2 | 1 | 3 | SFO | Steve Spurrier | QB | 22 | 1976 | 0 | 0 | 6 | ... | 2 | Florida | College Stats | /players/S/SpurSt00.htm | http://www.sports-reference.com/cfb/players/st ... | ||||||
3 | 1 | 4 | MIA | Bob Griese | QB | 22 | 1980 | 2 | 8 | 12 | ... | 7 | Purdue | College Stats | /players/G/GrieBo00.htm | http://www.sports-reference.com/cfb/players/bo ... | ||||||
4 | 1 | 5 | HOU | George Webster | LB | 21 | 1976 | 3 | 3 | 6 | ... | 5 | Michigan St. | College Stats | /players/W/WebsGe00.htm | http://www.sports-reference.com/cfb/players/ge ... |
5 rows × 30 columns
抓取自 1967 起所有季度的数据
抓取自 1967 年起所有的选秀数据基本上与上面的过程相同,只是使用一个 for
循环,对每个选秀年进行重复。
当我们遍历年份时,我们会为每一个选秀创建一个 DataFrame
,然后将其附加到 DataFrame
组成的包含所有选秀大列表中。我们也将有一个单独的列表,它会包含任何错误,以及与错误关联的 URL。这将让我们知道我们的爬虫是否有任何问题,以及哪个 url 导致了这个错误。我们还将为抢断(Tackle) 添加一个额外的列。抢断在 1993 年赛季结束后出现,因此,这就是一个我们需要插入到为从 1967 年到 1993 年的选秀创建的 DataFrame
中的列。
In [11]:
# Create an empty list that will contain all the dataframes
# (one dataframe for each draft)
draft_dfs_list = []
# a list to store any errors that may come up while scraping
errors_list = []
In [12]:
# The url template that we pass in the draft year inro
url_template = "http://www.pro-football-reference.com/years/ {year}/draft.htm"
# for each year from 1967 to (and including) 2016
for year in range(1967, 2017):
# Use try/except block to catch and inspect any urls that cause an error
try:
# get the draft url
url = url_template.format(year=year)
# get the html
html = urlopen(url)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
# get the column headers
column_headers = [th.getText() for th in
soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers.extend(["Player_NFL_Link", "Player_NCAA_Link"])
# select the data from the table using the '#drafts tr' CSS selector
table_rows = soup.select("#drafts tr")[2:]
# extract the player data from the table rows
player_data = extract_player_data(table_rows)
# create the dataframe for the current years draft
year_df = pd.DataFrame(player_data, columns=column_headers)
# if it is a draft from before 1994 then add a Tkl column at the
# 24th position
if year < 1994:
year_df.insert(24, "Tkl", "")
# add the year of the draft to the dataframe
year_df.insert(0, "Draft_Yr", year)
# append the current dataframe to the list of dataframes
draft_dfs_list.append(year_df)
except Exception as e:
# Store the url and the error it causes in a list
error =[url, e]
# then append it to the list of errors
errors_list.append(error)
In [13]:
len(errors_list)
Out[13]:
0
In [14]:
errors_list
Out[14]:
[]
没有获得任何错误,不错。
现在,我们可以连接所有抓取的 DataFrame
,并创建一个大的 DataFrame
,来包含所有的选秀。
In [15]:
# store all drafts in one DataFrame
draft_df = pd.concat(draft_dfs_list, ignore_index=True)
In [16]:
# Take a look at the first few rows
draft_df.head()
Out[16]:
Draft_Yr | Rnd | Pick | Tm | Pos | Age | To | AP1 | PB | ... | Rec | Yds | TD | Tkl | Int | Sk | College/Univ | Player_NFL_Link | Player_NCAA_Link | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1967 | 1 | 1 | BAL | Bubba Smith | DE | 22 | 1976 | 1 | 2 | ... | Michigan St. | College Stats | /players/S/SmitBu00.htm | http://www.sports-reference.com/cfb/players/bu ... | |||||||
1 | 1967 | 1 | 2 | MIN | Clint Jones | RB | 22 | 1973 | 0 | 0 | ... | 38 | 431 | 0 | Michigan St. | College Stats | /players/J/JoneCl00.htm | http://www.sports-reference.com/cfb/players/cl ... | ||||
2 | 1967 | 1 | 3 | SFO | Steve Spurrier | QB | 22 | 1976 | 0 | 0 | ... | Florida | College Stats | /players/S/SpurSt00.htm | http://www.sports-reference.com/cfb/players/st ... | |||||||
3 | 1967 | 1 | 4 | MIA | Bob Griese | QB | 22 | 1980 | 2 | 8 | ... | Purdue | College Stats | /players/G/GrieBo00.htm | http://www.sports-reference.com/cfb/players/bo ... | |||||||
4 | 1967 | 1 | 5 | HOU | George Webster | LB | 21 | 1976 | 3 | 3 | ... | 5 | Michigan St. | College Stats | /players/W/WebsGe00.htm | http://www.sports-reference.com/cfb/players/ge ... |
5 rows × 32 columns
由于有一些重复的列标题,有点甚至是空字符串,因此我们应该对这些列编辑一下下。
In [17]:
# get the current column headers from the dataframe as a list
column_headers = draft_df.columns.tolist()
# The 5th column header is an empty string, but represesents player names
column_headers[4] = "Player"
# Prepend "Rush_" for the columns that represent rushing stats
column_headers[19:22] = ["Rush_" + col for col in column_headers[19:22]]
# Prepend "Rec_" for the columns that reperesent receiving stats
column_headers[23:25] = ["Rec_" + col for col in column_headers[23:25]]
# Properly label the defensive int column as "Def_Int"
column_headers[-6] = "Def_Int"
# Just use "College" as the column header represent player's colleger or univ
column_headers[-4] = "College"
# Take a look at the updated column headers
column_headers
Out[17]:
['Draft_Yr',
'Rnd',
'Pick',
'Tm',
'Player',
'Pos',
'Age',
'To',
'AP1',
'PB',
'St',
'CarAV',
'DrAV',
'G',
'Cmp',
'Att',
'Yds',
'TD',
'Int',
'Rush_Att',
'Rush_Yds',
'Rush_TD',
'Rec',
'Rec_Yds',
'Rec_TD',
'Tkl',
'Def_Int',
'Sk',
'College',
'',
'Player_NFL_Link',
'Player_NCAA_Link']
In [18]:
# Now assign edited columns to the DataFrame
draft_df.columns = column_headers
现在,我们搞定了必要的列,让我们将原始数据写入到 CSV 文件中。
In [19]:
# Write out the raw draft data to the raw_data fold in the data folder
draft_df.to_csv("data/raw_data/pfr_nfl_draft_data_RAW.csv", index=False)
清理数据
现在,我们有了原始的选秀数据,需要把它清理干净一点,要进行一些我们想要的数据探索。
创建一个 Player ID/Links DataFrame
首先,让我们创建一个单独的 DataFrame
,它包含球员姓名,他们的球员页面链接,以及 Pro-Football-Reference 上的球员 ID。这样,我们就可以有一个单独的 CSV 文件,它仅包含必要的信息,以便于在将来某个时候,为 Pro-Football-Reference 提取单个球员数据。
要从球员链接提取 Pro-Football-Reference 球员 ID,我们将需要使用 正则表达式 。正则表达式是一个字符序列,用来在文本正文中匹配某种模式。我们可以用来匹配球员链接并抽取 ID 的正则表达式如下:
/.*/.*/(.*)\.
上面的正则表达式基本上表示,匹配具有以下模式的字符串:
- 一个
'/'
. - 后面紧接着 0 或多个字符 (这由
'.*'
字符表示)。 - 紧接着另一个
'/'
(第二个'/'
字符)。 - 紧接着 0 或多个字符 (再次,
'.*'
字符)。 - 紧接着另一个 (第三次)
'/'
。 - 紧接着 0 或多个字符分组 (
'(.*)'
字符)。- 这是我们的正则表达式的关键部分。
'()'
在我们想要提取的字符周围创建一个分组。由于球员 ID 位于第三个'/'
和'.'
之间,因此我们使用'(.*)'
来抽取在我们的字符串中的那部分发现的所有字符。
- 这是我们的正则表达式的关键部分。
- 接着是
'.'
,球员 ID 后的字符。
我们可以通过将上面的正则表达式传递给 pandas extract
方法,来提取 ID。
In [20]:
# extract the player id from the player links
# expand=False returns the IDs as a pandas Series
player_ids = draft_df.Player_NFL_Link.str.extract("/.*/.*/(.*)\.",
expand=False)
In [21]:
# add a Player_ID column to our draft_df
draft_df["Player_ID"] = player_ids
In [22]:
# add the beginning of the pfr url to the player link column
pfr_url = "http://www.pro-football-reference.com"
draft_df.Player_NFL_Link = pfr_url + draft_df.Player_NFL_Link
现在,我们可以保存一个仅包含球员姓名、ID 和链接的 DataFrame
了。
In [23]:
# Get the Player name, IDs, and links
player_id_df = draft_df.loc[:, ["Player", "Player_ID", "Player_NFL_Link",
"Player_NCAA_Link"]]
# Save them to a CSV file
player_id_df.to_csv("data/clean_data/pfr_player_ids_and_links.csv",
index=False)
清理剩下的选秀数据
现在,我们完成了对球员 ID 的处理,让我们回到处理选秀数据。
首先,上次一些不必要的列。
In [24]:
# drop the the player links and the column labeled by an empty string
draft_df.drop(draft_df.columns[-4:-1], axis=1, inplace=True)
剩下的选秀数据留下的主要问题是,将所有东西转换成正确的数据类型。
In [25]:
draft_df.info()
Int64Index: 15845 entries, 0 to 15844
Data columns (total 30 columns):
Draft_Yr 15845 non-null int64
Rnd 15845 non-null object
Pick 15845 non-null object
Tm 15845 non-null object
Player 15845 non-null object
Pos 15845 non-null object
Age 15845 non-null object
To 15845 non-null object
AP1 15845 non-null object
PB 15845 non-null object
St 15845 non-null object
CarAV 15845 non-null object
DrAV 15845 non-null object
G 15845 non-null object
Cmp 15845 non-null object
Att 15845 non-null object
Yds 15845 non-null object
TD 15845 non-null object
Int 15845 non-null object
Rush_Att 15845 non-null object
Rush_Yds 15845 non-null object
Rush_TD 15845 non-null object
Rec 15845 non-null object
Rec_Yds 15845 non-null object
Rec_TD 15845 non-null object
Tkl 15845 non-null object
Def_Int 15845 non-null object
Sk 15845 non-null object
College 15845 non-null object
Player_ID 11416 non-null object
dtypes: int64(1), object(29)
memory usage: 3.7+ MB
从上面我们可以看到,许多球员数据在应该是数字的时候却不是。要将所有的列转换成它们正确的数值类型,我们可以将 to_numeric
函数应用到整个 DataFrame
之上。由于不可能转换一些列(例如,Player, Tm,等等。) 到一个数值类型(因为它们并不是数字),因此我们需要设置 errors
参数为"ignore",从而避免引起任何错误。
In [26]:
# convert the data to proper numeric types
draft_df = draft_df.apply(pd.to_numeric, errors="ignore")
In [27]:
draft_df.info()
Int64Index: 15845 entries, 0 to 15844
Data columns (total 30 columns):
Draft_Yr 15845 non-null int64
Rnd 15845 non-null int64
Pick 15845 non-null int64
Tm 15845 non-null object
Player 15845 non-null object
Pos 15845 non-null object
Age 11297 non-null float64
To 10995 non-null float64
AP1 15845 non-null int64
PB 15845 non-null int64
St 15845 non-null int64
CarAV 10995 non-null float64
DrAV 9571 non-null float64
G 10962 non-null float64
Cmp 1033 non-null float64
Att 1033 non-null float64
Yds 1033 non-null float64
TD 1033 non-null float64
Int 1033 non-null float64
Rush_Att 2776 non-null float64
Rush_Yds 2776 non-null float64
Rush_TD 2776 non-null float64
Rec 3395 non-null float64
Rec_Yds 3395 non-null float64
Rec_TD 3395 non-null float64
Tkl 3644 non-null float64
Def_Int 2590 non-null float64
Sk 2670 non-null float64
College 15845 non-null object
Player_ID 11416 non-null object
dtypes: float64(19), int64(6), object(5)
memory usage: 3.7+ MB
我们还没有完成。很多数值列数据缺失,因为球员并没有累计任何那些统计数据。例如,一些球员并没有获得一个 TD,甚至没有进行一场比赛。然我们选择带有数值数据的列,然后用 0 替换 NaN
(当前表示缺失数据的值),因为那是一个更合适的值。
In [28]:
# Get the column names for the numeric columns
num_cols = draft_df.columns[draft_df.dtypes != object]
# Replace all NaNs with 0
draft_df.loc[:, num_cols] = draft_df.loc[:, num_cols].fillna(0)
In [29]:
# Everything is filled, except for Player_ID, which is fine for now
draft_df.info()
Int64Index: 15845 entries, 0 to 15844
Data columns (total 30 columns):
Draft_Yr 15845 non-null int64
Rnd 15845 non-null int64
Pick 15845 non-null int64
Tm 15845 non-null object
Player 15845 non-null object
Pos 15845 non-null object
Age 15845 non-null float64
To 15845 non-null float64
AP1 15845 non-null int64
PB 15845 non-null int64
St 15845 non-null int64
CarAV 15845 non-null float64
DrAV 15845 non-null float64
G 15845 non-null float64
Cmp 15845 non-null float64
Att 15845 non-null float64
Yds 15845 non-null float64
TD 15845 non-null float64
Int 15845 non-null float64
Rush_Att 15845 non-null float64
Rush_Yds 15845 non-null float64
Rush_TD 15845 non-null float64
Rec 15845 non-null float64
Rec_Yds 15845 non-null float64
Rec_TD 15845 non-null float64
Tkl 15845 non-null float64
Def_Int 15845 non-null float64
Sk 15845 non-null float64
College 15845 non-null object
Player_ID 11416 non-null object
dtypes: float64(19), int64(6), object(5)
memory usage: 3.7+ MB
最后,我们完成了数据清理,现在,我们可以将其保存到一个 CSV 文件中去了。
In [30]:
draft_df.to_csv("data/clean_data/pfr_nfl_draft_data_CLEAN.csv", index=False)
探索 NFL 选秀
现在,我们完成了获取和清理所要的数据,最后可以来些好玩的事了。首先,让我们保持选秀数据更新并包含 2010 年的选秀,因为,那些更近期参与选秀的球员尚未累计足够的数据,以拥有一个正确的代表性生涯 近似值(Approximate Value) (或者 cAV)。
In [31]:
# get data for drafts from 1967 to 2010
draft_df_2010 = draft_df.loc[draft_df.Draft_Yr <= 2010, :]
In [32]:
draft_df_2010.tail() # we see that the last draft is 2010
Out[32]:
Draft_Yr | Rnd | Pick | Tm | Player | Pos | Age | To | AP1 | PB | ... | Rush_Yds | Rush_TD | Rec | Rec_Yds | Rec_TD | Tkl | Def_Int | Sk | College | Player_ID | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14314 | 2010 | 7 | 251 | OAK | Stevie Brown | DB | 23.0 | 2014.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 98.0 | 8.0 | 1.0 | Michigan | BrowSt99 | |
14315 | 2010 | 7 | 252 | MIA | Austin Spitler | LB | 23.0 | 2013.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 0.0 | 0.0 | Ohio St. | SpitAu99 | |
14316 | 2010 | 7 | 253 | TAM | Erik Lorig | DE | 23.0 | 2014.0 | 0 | 0 | ... | 4.0 | 0.0 | 39.0 | 220.0 | 2.0 | 3.0 | 0.0 | 0.0 | Stanford | LoriEr99 | |
14317 | 2010 | 7 | 254 | STL | Josh Hull | LB | 23.0 | 2013.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 0.0 | 0.0 | Penn St. | HullJo99 | |
14318 | 2010 | 7 | 255 | DET | Tim Toone | WR | 25.0 | 2012.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Weber St. | ToonTi00 |
5 rows × 30 columns
职业近似值分布
使用 seaborn
的 distplot
函数,我们可以快速地看到 cAV 分布的样子,既包括 直方图 ,又包括 核密度估计 。
In [33]:
# set some plotting styles
from matplotlib import rcParams
# set the font scaling and the plot sizes
sns.set(font_scale=1.65)
rcParams["figure.figsize"] = 12,9
In [34]:
# Use distplot to view the distribu
sns.distplot(draft_df_2010.CarAV)
plt.title("Distribution of Career Approximate Value")
plt.xlim(-5,150)
plt.show()
我们还可以通过 boxplot
函数,看到按位置分布。
In [35]:
sns.boxplot(x="Pos", y="CarAV", data=draft_df_2010)
plt.title("Distribution of Career Approximate Value by Position (1967-2010)")
plt.show()
从上面的两张图中,我们看到,大多数的球员最终在他们的 NFL 职业生涯中并未做很多事,因为大多数的球员在 0-10 cAV 范围周围徘徊。
还有一些位置,对于整个分布,具有 0 cAV,或者非常低(及小的)cAV 分布。我们可以从厦门的值数看到,这可能是出于这样的事实,有非常少的球员带有那些位置标签。
In [36]:
# Look at the counts for each position
draft_df_2010.Pos.value_counts()
Out[36]:
DB 2456
LB 1910
RB 1686
WR 1636
DE 1130
T 1091
G 959
DT 889
TE 802
QB 667
C 425
K 187
P 150
NT 127
FB 84
FL 63
E 29
HB 23
KR 3
WB 2
Name: Pos, dtype: int64
让我们丢弃那些位置,然后将"HB"球员和"RB"球员合并在一起。
In [37]:
# drop players from the following positions [FL, E, WB, KR]
drop_idx = ~ draft_df_2010.Pos.isin(["FL", "E", "WB", "KR"])
draft_df_2010 = draft_df_2010.loc[drop_idx, :]
In [38]:
# Now replace HB label with RB label
draft_df_2010.loc[draft_df_2010.Pos == "HB", "Pos"] = "RB"
让我们再看看位置分布。
In [39]:
sns.boxplot(x="Pos", y="CarAV", data=draft_df_2010)
plt.title("Distribution of Career Approximate Value by Position (1967-2010)")
plt.show()
拟合选秀曲线
现在,我们可以拟合一条曲线,来看看每个选择的 cAV。我们将使用 l 局部回归 来拟合曲线,它沿着数据“旅行”,每次拟合一条曲线到小块数据。
seaborn
让我们通过使用 regplot
并设置 lowess
参数为 True
,非常轻松地绘制一条 Lowess 曲线。
In [40]:
# plot LOWESS curve
# set line color to be black, and scatter color to cyan
sns.regplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True,
line_kws={"color": "black"},
scatter_kws={"color": sns.color_palette()[5], "alpha": 0.5})
plt.title("Career Approximate Value by Pick")
plt.xlim(-5, 500)
plt.ylim(-5, 200)
plt.show()
我们也可以使用 lmplot
并设置 hue
为"Pos",为每个位置拟合一条 Lowess 曲线。
In [41]:
# Fit a LOWESS curver for each position
sns.lmplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True, hue="Pos",
size=10, scatter=False)
plt.title("Career Approximate Value by Pick and Position")
plt.xlim(-5, 500)
plt.ylim(-1, 60)
plt.show()
由于太多的线条,上面的图有点太乱了。我们可以实际将曲线分离出来,并单独绘制位置曲线。要在不设置 hue
为"Pos"的情况下做到这一点,我们可以设置 col
为"Pos"。要将所有图都组织到 5x3 格子里,我们必须设置 col_wrap
为 5。
In [44]:
lm = sns.lmplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True, col="Pos",
col_wrap=5, size=4, line_kws={"color": "black"},
scatter_kws={"color": sns.color_palette()[5], "alpha": 0.7})
# add title to the plot (which is a FacetGrid)
# https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot
plt.subplots_adjust(top=0.9)
lm.fig.suptitle("Career Approximate Value by Pick and Position",
fontsize=30)
plt.xlim(-5, 500)
plt.ylim(-1, 100)
plt.show()
其他资源
下面是涵盖这种东西一些其他的资源:
- 看看 Michael Lopez 最近关于 nfl 选秀,以及为每场主要比赛构建和比较选秀曲线的文章:
- 如果你刚开始使用 Python,我建议读一读 用 Python 自动化无聊的东西 。其中有涵盖了 网页抓取 和 正则表达式 的章节。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论