使用 Python 探索 NFL 选秀

发布于 2025-01-17 12:31:12 字数 34847 浏览 11 评论 0

在读过三两个 Michael Lopez 关于 NFL 选秀的文章后，我决定使用 Python（取代 R），来重现他的分析。

首先，让我们导入将用到的大部分东东。

注意：你可以在这里找到这篇文章的 github 仓库。它包括这个 notebook，数据和我所使用的 conda 环境。

In [1]:


    %matplotlib inline

    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd

    from urllib.request import urlopen
    from bs4 import BeautifulSoup

网页抓取

在我们开始之前，需要一些数据。我们将从 Pro-Football-Reference 那里抓取选秀数据，然后清理它们以进行分析。

我们会使用 BeautifulSoup 来抓取数据，然后将其存储到一个 pandas Dataframe 中。

要感受下该数据，让我们看看 1967 选秀。

上面仅是在页面上找到的一小部分选秀表格。我们将提取列标题的第二行以及每个选择的所有信息。虽然采用这种方式，我们还会抓取每个选手的 Pro-Football- Reference 选手页面链接以及大学统计链接。这样，如果将来想要从他们的选手页面抽取数据，就可以做到了。

In [2]:


    # The url we will be scraping
    url_1967 = "http://www.pro-football-reference.com/years/1967/draft.htm"

    # get the html
    html = urlopen(url_1967)

    # create the BeautifulSoup object
    soup = BeautifulSoup(html, "lxml")

抓取列标题

我们的 DataFrame 所需的列标题位于 PFR 表单的列标题的第二行。我们将抓取它，并且为两个额外的球员页面链接添加两个额外的列标题。

In [3]:


    # Extract the necessary values for the column headers from the table
    # and store them as a list
    column_headers = [th.getText() for th in 
                      soup.findAll('tr', limit=2)[1].findAll('th')]

    # Add the two additional column headers for the player links
    column_headers.extend(["Player_NFL_Link", "Player_NCAA_Link"])

抓取数据

使用 CSS 选择器 "#draft tr" ，我们可以很容易地提出数据行。我们基本上做的是，选择 id 值为 "draft" 的 HTML 元素内的表行元素。

谈到查找 CSS 选择器，一个非常有用的工具是 SelectorGadget 。这是一个网络扩展，它允许你点击一个网页的不同元素，然后为那些所选的元素提供 CSS 选择题。

In [4]:


    # The data is found within the table rows of the element with id=draft
    # We want the elements from the 3rd row and on
    table_rows = soup.select("#drafts tr")[2:]

注意到， table_rows 是一个标签元素列表。

In [5]:


    type(table_rows)

Out[5]:


    list

In [6]:


    type(table_rows[0])

Out[6]:


    bs4.element.Tag

In [7]:


    table_rows[0] # take a look at the first row

Out[7]:

1
1
[BAL]( http://savvastjortjoglou.com/teams/clt/1967_draft.htm "Baltimore olts)
[Bubba Smith]( http://savvastjortjoglou.com/players/S/SmitBu00.htm) 
DE
22
1976
1
2
6
62
46
111     












[Michigan St.]( http://savvastjortjoglou.com/schools/michiganst/) 
[College Stats]( http://www.sports-reference.com/cfb/players/bubba-smith-2.tml)

在 td (或者表格数据) 元素内，可以找到对于每个球员，我们所要的数据。

下面，我创建了一个函数，它从 table_rows 中抽取我们想要的数据。注释会带你看到该函数的每个部分做了什么。

In [8]:


    def extract_player_data(table_rows):
        """
        Extract and return the the desired information from the td elements within
        the table rows.
        """
        # create the empty list to store the player data
        player_data = []

        for row in table_rows:  # for each row do the following

            # Get the text for each table data (td) element in the row
            # Some player names end with ' HOF', if they do, get the text excluding
            # those last 4 characters,
            # otherwise get all the text data from the table data
            player_list = [td.get_text()[:-4] if td.get_text().endswith(" HOF") 
                           else td.get_text() for td in row.find_all("td")]

            # there are some empty table rows, which are the repeated 
            # column headers in the table
            # we skip over those rows and and continue the for loop
            if not player_list:
                continue

            # Extracting the player links
            # Instead of a list we create a dictionary, this way we can easily
            # match the player name with their pfr url
            # For all "a" elements in the row, get the text
            # NOTE: Same " HOF" text issue as the player_list above
            links_dict = {(link.get_text()[:-4]   # exclude the last 4 characters
                           if link.get_text().endswith(" HOF")  # if they are " HOF"
                           # else get all text, set thet as the dictionary key 
                           # and set the url as the value
                           else link.get_text()) : link["href"] 
                           for link in row.find_all("a", href=True)}

            # The data we want from the dictionary can be extracted using the
            # player's name, which returns us their pfr url, and "College Stats"
            # which returns us their college stats page

            # add the link associated to the player's pro-football-reference page, 
            # or en empty string if there is no link
            player_list.append(links_dict.get(player_list[3], ""))

            # add the link for the player's college stats or an empty string
            # if ther is no link
            player_list.append(links_dict.get("College Stats", ""))

            # Now append the data to list of data
            player_data.append(player_list)

        return player_data

现在，我们可以使用来自 1967 年选秀的数据来创建 DataFrame 。

In [9]:


    # extract the data we want
    data = extract_player_data(table_rows)

    # and then store it in a DataFrame
    df_1967 = pd.DataFrame(data, columns=column_headers)

In [10]:


    df_1967.head()

Out[10]:

		Rnd	Pick	Tm		Pos	Age	To	AP1	PB	St	...	TD	Rec	Yds	TD	Sk	College/Univ		Player_NFL_Link
0	1	1	BAL	Bubba Smith	DE	22	1976	1	2	6	...						Michigan St.	College Stats	/players/S/SmitBu00.htm	http://www.sports-reference.com/cfb/players/bu ...
1	1	2	MIN	Clint Jones	RB	22	1973	0	0	2	...	20	38	431	0		Michigan St.	College Stats	/players/J/JoneCl00.htm	http://www.sports-reference.com/cfb/players/cl ...
2	1	3	SFO	Steve Spurrier	QB	22	1976	0	0	6	...	2					Florida	College Stats	/players/S/SpurSt00.htm	http://www.sports-reference.com/cfb/players/st ...
3	1	4	MIA	Bob Griese	QB	22	1980	2	8	12	...	7					Purdue	College Stats	/players/G/GrieBo00.htm	http://www.sports-reference.com/cfb/players/bo ...
4	1	5	HOU	George Webster	LB	21	1976	3	3	6	...					5	Michigan St.	College Stats	/players/W/WebsGe00.htm	http://www.sports-reference.com/cfb/players/ge ...

5 rows × 30 columns

抓取自 1967 起所有季度的数据

抓取自 1967 年起所有的选秀数据基本上与上面的过程相同，只是使用一个 for 循环，对每个选秀年进行重复。

当我们遍历年份时，我们会为每一个选秀创建一个 DataFrame ，然后将其附加到 DataFrame 组成的包含所有选秀大列表中。我们也将有一个单独的列表，它会包含任何错误，以及与错误关联的 URL。这将让我们知道我们的爬虫是否有任何问题，以及哪个 url 导致了这个错误。我们还将为抢断(Tackle) 添加一个额外的列。抢断在 1993 年赛季结束后出现，因此，这就是一个我们需要插入到为从 1967 年到 1993 年的选秀创建的 DataFrame 中的列。

In [11]:


    # Create an empty list that will contain all the dataframes
    # (one dataframe for each draft)
    draft_dfs_list = []

    # a list to store any errors that may come up while scraping
    errors_list = []

In [12]:


    # The url template that we pass in the draft year inro
    url_template = "http://www.pro-football-reference.com/years/ {year}/draft.htm"

    # for each year from 1967 to (and including) 2016
    for year in range(1967, 2017): 

        # Use try/except block to catch and inspect any urls that cause an error
        try:
            # get the draft url
            url = url_template.format(year=year)

            # get the html
            html = urlopen(url)

            # create the BeautifulSoup object
            soup = BeautifulSoup(html, "lxml") 

            # get the column headers
            column_headers = [th.getText() for th in 
                              soup.findAll('tr', limit=2)[1].findAll('th')]
            column_headers.extend(["Player_NFL_Link", "Player_NCAA_Link"])

            # select the data from the table using the '#drafts tr' CSS selector
            table_rows = soup.select("#drafts tr")[2:] 

            # extract the player data from the table rows
            player_data = extract_player_data(table_rows)

            # create the dataframe for the current years draft
            year_df = pd.DataFrame(player_data, columns=column_headers)

            # if it is a draft from before 1994 then add a Tkl column at the 
            # 24th position
            if year < 1994:
                year_df.insert(24, "Tkl", "")

            # add the year of the draft to the dataframe
            year_df.insert(0, "Draft_Yr", year)

            # append the current dataframe to the list of dataframes
            draft_dfs_list.append(year_df)

        except Exception as e:
            # Store the url and the error it causes in a list
            error =[url, e] 
            # then append it to the list of errors
            errors_list.append(error)

In [13]:


    len(errors_list)

Out[13]:

In [14]:


    errors_list

Out[14]:

[]

没有获得任何错误，不错。

现在，我们可以连接所有抓取的 DataFrame ，并创建一个大的 DataFrame ，来包含所有的选秀。

In [15]:


    # store all drafts in one DataFrame
    draft_df = pd.concat(draft_dfs_list, ignore_index=True)

In [16]:


    # Take a look at the first few rows
    draft_df.head()

Out[16]:

		Draft_Yr	Rnd	Pick	Tm		Pos	Age	To	AP1	PB	...	Rec	Yds	Tkl	Sk	College/Univ		Player_NFL_Link
0	1967	1	1	BAL	Bubba Smith	DE	22	1976	1	2	...					Michigan St.	College Stats	/players/S/SmitBu00.htm	http://www.sports-reference.com/cfb/players/bu ...
1	1967	1	2	MIN	Clint Jones	RB	22	1973	0	0	...	38	431	0		Michigan St.	College Stats	/players/J/JoneCl00.htm	http://www.sports-reference.com/cfb/players/cl ...
2	1967	1	3	SFO	Steve Spurrier	QB	22	1976	0	0	...					Florida	College Stats	/players/S/SpurSt00.htm	http://www.sports-reference.com/cfb/players/st ...
3	1967	1	4	MIA	Bob Griese	QB	22	1980	2	8	...					Purdue	College Stats	/players/G/GrieBo00.htm	http://www.sports-reference.com/cfb/players/bo ...
4	1967	1	5	HOU	George Webster	LB	21	1976	3	3	...				5	Michigan St.	College Stats	/players/W/WebsGe00.htm	http://www.sports-reference.com/cfb/players/ge ...

5 rows × 32 columns

由于有一些重复的列标题，有点甚至是空字符串，因此我们应该对这些列编辑一下下。

In [17]:


    # get the current column headers from the dataframe as a list
    column_headers = draft_df.columns.tolist()

    # The 5th column header is an empty string, but represesents player names
    column_headers[4] = "Player"

    # Prepend "Rush_" for the columns that represent rushing stats 
    column_headers[19:22] = ["Rush_" + col for col in column_headers[19:22]]

    # Prepend "Rec_" for the columns that reperesent receiving stats
    column_headers[23:25] = ["Rec_" + col for col in column_headers[23:25]]

    # Properly label the defensive int column as "Def_Int"
    column_headers[-6] = "Def_Int"

    # Just use "College" as the column header represent player's colleger or univ
    column_headers[-4] = "College"

    # Take a look at the updated column headers
    column_headers

Out[17]:


    ['Draft_Yr',
     'Rnd',
     'Pick',
     'Tm',
     'Player',
     'Pos',
     'Age',
     'To',
     'AP1',
     'PB',
     'St',
     'CarAV',
     'DrAV',
     'G',
     'Cmp',
     'Att',
     'Yds',
     'TD',
     'Int',
     'Rush_Att',
     'Rush_Yds',
     'Rush_TD',
     'Rec',
     'Rec_Yds',
     'Rec_TD',
     'Tkl',
     'Def_Int',
     'Sk',
     'College',
     '',
     'Player_NFL_Link',
     'Player_NCAA_Link']

In [18]:


    # Now assign edited columns to the DataFrame
    draft_df.columns = column_headers

现在，我们搞定了必要的列，让我们将原始数据写入到 CSV 文件中。

In [19]:


    # Write out the raw draft data to the raw_data fold in the data folder
    draft_df.to_csv("data/raw_data/pfr_nfl_draft_data_RAW.csv", index=False)

清理数据

现在，我们有了原始的选秀数据，需要把它清理干净一点，要进行一些我们想要的数据探索。

创建一个 Player ID/Links `DataFrame`

首先，让我们创建一个单独的 DataFrame ，它包含球员姓名，他们的球员页面链接，以及 Pro-Football-Reference 上的球员 ID。这样，我们就可以有一个单独的 CSV 文件，它仅包含必要的信息，以便于在将来某个时候，为 Pro-Football-Reference 提取单个球员数据。

要从球员链接提取 Pro-Football-Reference 球员 ID，我们将需要使用正则表达式。正则表达式是一个字符序列，用来在文本正文中匹配某种模式。我们可以用来匹配球员链接并抽取 ID 的正则表达式如下：


    /.*/.*/(.*)\.

上面的正则表达式基本上表示，匹配具有以下模式的字符串：

一个 '/' .
后面紧接着 0 或多个字符 (这由 '.*' 字符表示)。
紧接着另一个 '/' (第二个 '/' 字符)。
紧接着 0 或多个字符 (再次， '.*' 字符)。
紧接着另一个 (第三次) '/' 。
紧接着 0 或多个字符分组 ( '(.*)' 字符)。
- 这是我们的正则表达式的关键部分。 '()' 在我们想要提取的字符周围创建一个分组。由于球员 ID 位于第三个 '/' 和 '.' 之间，因此我们使用 '(.*)' 来抽取在我们的字符串中的那部分发现的所有字符。
接着是 '.' ，球员 ID 后的字符。

我们可以通过将上面的正则表达式传递给 pandas extract 方法，来提取 ID。

In [20]:


    # extract the player id from the player links
    # expand=False returns the IDs as a pandas Series
    player_ids = draft_df.Player_NFL_Link.str.extract("/.*/.*/(.*)\.", 
                                                      expand=False)

In [21]:


    # add a Player_ID column to our draft_df
    draft_df["Player_ID"] = player_ids

In [22]:


    # add the beginning of the pfr url to the player link column
    pfr_url = "http://www.pro-football-reference.com"
    draft_df.Player_NFL_Link =  pfr_url + draft_df.Player_NFL_Link

现在，我们可以保存一个仅包含球员姓名、ID 和链接的 DataFrame 了。

In [23]:


    # Get the Player name, IDs, and links
    player_id_df = draft_df.loc[:, ["Player", "Player_ID", "Player_NFL_Link", 
                                    "Player_NCAA_Link"]]
    # Save them to a CSV file
    player_id_df.to_csv("data/clean_data/pfr_player_ids_and_links.csv",
                        index=False)

清理剩下的选秀数据

现在，我们完成了对球员 ID 的处理，让我们回到处理选秀数据。

首先，上次一些不必要的列。

In [24]:


    # drop the the player links and the column labeled by an empty string
    draft_df.drop(draft_df.columns[-4:-1], axis=1, inplace=True)

剩下的选秀数据留下的主要问题是，将所有东西转换成正确的数据类型。

In [25]:


    draft_df.info()


    Int64Index: 15845 entries, 0 to 15844
    Data columns (total 30 columns):
    Draft_Yr     15845 non-null int64
    Rnd          15845 non-null object
    Pick         15845 non-null object
    Tm           15845 non-null object
    Player       15845 non-null object
    Pos          15845 non-null object
    Age          15845 non-null object
    To           15845 non-null object
    AP1          15845 non-null object
    PB           15845 non-null object
    St           15845 non-null object
    CarAV        15845 non-null object
    DrAV         15845 non-null object
    G            15845 non-null object
    Cmp          15845 non-null object
    Att          15845 non-null object
    Yds          15845 non-null object
    TD           15845 non-null object
    Int          15845 non-null object
    Rush_Att     15845 non-null object
    Rush_Yds     15845 non-null object
    Rush_TD      15845 non-null object
    Rec          15845 non-null object
    Rec_Yds      15845 non-null object
    Rec_TD       15845 non-null object
    Tkl          15845 non-null object
    Def_Int      15845 non-null object
    Sk           15845 non-null object
    College      15845 non-null object
    Player_ID    11416 non-null object
    dtypes: int64(1), object(29)
    memory usage: 3.7+ MB

从上面我们可以看到，许多球员数据在应该是数字的时候却不是。要将所有的列转换成它们正确的数值类型，我们可以将 to_numeric 函数应用到整个 DataFrame 之上。由于不可能转换一些列（例如，Player, Tm，等等。) 到一个数值类型（因为它们并不是数字)，因此我们需要设置 errors 参数为"ignore"，从而避免引起任何错误。

In [26]:


    # convert the data to proper numeric types
    draft_df = draft_df.apply(pd.to_numeric, errors="ignore")

In [27]:


    draft_df.info()


    Int64Index: 15845 entries, 0 to 15844
    Data columns (total 30 columns):
    Draft_Yr     15845 non-null int64
    Rnd          15845 non-null int64
    Pick         15845 non-null int64
    Tm           15845 non-null object
    Player       15845 non-null object
    Pos          15845 non-null object
    Age          11297 non-null float64
    To           10995 non-null float64
    AP1          15845 non-null int64
    PB           15845 non-null int64
    St           15845 non-null int64
    CarAV        10995 non-null float64
    DrAV         9571 non-null float64
    G            10962 non-null float64
    Cmp          1033 non-null float64
    Att          1033 non-null float64
    Yds          1033 non-null float64
    TD           1033 non-null float64
    Int          1033 non-null float64
    Rush_Att     2776 non-null float64
    Rush_Yds     2776 non-null float64
    Rush_TD      2776 non-null float64
    Rec          3395 non-null float64
    Rec_Yds      3395 non-null float64
    Rec_TD       3395 non-null float64
    Tkl          3644 non-null float64
    Def_Int      2590 non-null float64
    Sk           2670 non-null float64
    College      15845 non-null object
    Player_ID    11416 non-null object
    dtypes: float64(19), int64(6), object(5)
    memory usage: 3.7+ MB

我们还没有完成。很多数值列数据缺失，因为球员并没有累计任何那些统计数据。例如，一些球员并没有获得一个 TD，甚至没有进行一场比赛。然我们选择带有数值数据的列，然后用 0 替换 NaN (当前表示缺失数据的值)，因为那是一个更合适的值。

In [28]:


    # Get the column names for the numeric columns
    num_cols = draft_df.columns[draft_df.dtypes != object]

    # Replace all NaNs with 0
    draft_df.loc[:, num_cols] = draft_df.loc[:, num_cols].fillna(0)

In [29]:


    # Everything is filled, except for Player_ID, which is fine for now
    draft_df.info()


    Int64Index: 15845 entries, 0 to 15844
    Data columns (total 30 columns):
    Draft_Yr     15845 non-null int64
    Rnd          15845 non-null int64
    Pick         15845 non-null int64
    Tm           15845 non-null object
    Player       15845 non-null object
    Pos          15845 non-null object
    Age          15845 non-null float64
    To           15845 non-null float64
    AP1          15845 non-null int64
    PB           15845 non-null int64
    St           15845 non-null int64
    CarAV        15845 non-null float64
    DrAV         15845 non-null float64
    G            15845 non-null float64
    Cmp          15845 non-null float64
    Att          15845 non-null float64
    Yds          15845 non-null float64
    TD           15845 non-null float64
    Int          15845 non-null float64
    Rush_Att     15845 non-null float64
    Rush_Yds     15845 non-null float64
    Rush_TD      15845 non-null float64
    Rec          15845 non-null float64
    Rec_Yds      15845 non-null float64
    Rec_TD       15845 non-null float64
    Tkl          15845 non-null float64
    Def_Int      15845 non-null float64
    Sk           15845 non-null float64
    College      15845 non-null object
    Player_ID    11416 non-null object
    dtypes: float64(19), int64(6), object(5)
    memory usage: 3.7+ MB

最后，我们完成了数据清理，现在，我们可以将其保存到一个 CSV 文件中去了。

In [30]:


    draft_df.to_csv("data/clean_data/pfr_nfl_draft_data_CLEAN.csv", index=False)

探索 NFL 选秀

现在，我们完成了获取和清理所要的数据，最后可以来些好玩的事了。首先，让我们保持选秀数据更新并包含 2010 年的选秀，因为，那些更近期参与选秀的球员尚未累计足够的数据，以拥有一个正确的代表性生涯近似值(Approximate Value) (或者 cAV)。

In [31]:


    # get data for drafts from 1967 to 2010
    draft_df_2010 = draft_df.loc[draft_df.Draft_Yr <= 2010, :]

In [32]:


    draft_df_2010.tail() # we see that the last draft is 2010

Out[32]:

		Draft_Yr	Rnd	Pick	Tm	Player	Pos	Age	PB	...	Rush_TD	Rec	Rec_Yds	Rec_TD	Tkl	Def_Int	Sk	College
14314	2010	7	251	OAK	Stevie Brown	DB	23.0	2014.0	...	0.0	0.0	0.0	0.0	98.0	8.0	1.0	Michigan	BrowSt99
14315	2010	7	252	MIA	Austin Spitler	LB	23.0	2013.0	...	0.0	0.0	0.0	0.0	11.0	0.0	0.0	Ohio St.	SpitAu99
14316	2010	7	253	TAM	Erik Lorig	DE	23.0	2014.0	...	4.0	39.0	220.0	2.0	3.0	0.0	0.0	Stanford	LoriEr99
14317	2010	7	254	STL	Josh Hull	LB	23.0	2013.0	...	0.0	0.0	0.0	0.0	11.0	0.0	0.0	Penn St.	HullJo99
14318	2010	7	255	DET	Tim Toone	WR	25.0	2012.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	Weber St.	ToonTi00

5 rows × 30 columns

职业近似值分布

使用 seaborn 的 distplot 函数，我们可以快速地看到 cAV 分布的样子，既包括直方图，又包括核密度估计。

In [33]:


    # set some plotting styles
    from matplotlib import rcParams

    # set the font scaling and the plot sizes
    sns.set(font_scale=1.65)
    rcParams["figure.figsize"] = 12,9

In [34]:


    # Use distplot to view the distribu
    sns.distplot(draft_df_2010.CarAV)
    plt.title("Distribution of Career Approximate Value")
    plt.xlim(-5,150)
    plt.show()

我们还可以通过 boxplot 函数，看到按位置分布。

In [35]:


    sns.boxplot(x="Pos", y="CarAV", data=draft_df_2010)
    plt.title("Distribution of Career Approximate Value by Position (1967-2010)")
    plt.show()

从上面的两张图中，我们看到，大多数的球员最终在他们的 NFL 职业生涯中并未做很多事，因为大多数的球员在 0-10 cAV 范围周围徘徊。

还有一些位置，对于整个分布，具有 0 cAV，或者非常低（及小的）cAV 分布。我们可以从厦门的值数看到，这可能是出于这样的事实，有非常少的球员带有那些位置标签。

In [36]:


    # Look at the counts for each position
    draft_df_2010.Pos.value_counts()

Out[36]:


    DB    2456
    LB    1910
    RB    1686
    WR    1636
    DE    1130
    T     1091
    G      959
    DT     889
    TE     802
    QB     667
    C      425
    K      187
    P      150
    NT     127
    FB      84
    FL      63
    E       29
    HB      23
    KR       3
    WB       2
    Name: Pos, dtype: int64

让我们丢弃那些位置，然后将"HB"球员和"RB"球员合并在一起。

In [37]:


    # drop players from the following positions [FL, E, WB, KR]
    drop_idx = ~ draft_df_2010.Pos.isin(["FL", "E", "WB", "KR"])

    draft_df_2010 = draft_df_2010.loc[drop_idx, :]

In [38]:


    # Now replace HB label with RB label
    draft_df_2010.loc[draft_df_2010.Pos == "HB", "Pos"] = "RB"

让我们再看看位置分布。

In [39]:


    sns.boxplot(x="Pos", y="CarAV", data=draft_df_2010)
    plt.title("Distribution of Career Approximate Value by Position (1967-2010)")
    plt.show()

拟合选秀曲线

现在，我们可以拟合一条曲线，来看看每个选择的 cAV。我们将使用 l 局部回归来拟合曲线，它沿着数据“旅行”，每次拟合一条曲线到小块数据。

seaborn 让我们通过使用 regplot 并设置 lowess 参数为 True ，非常轻松地绘制一条 Lowess 曲线。

In [40]:


    # plot LOWESS curve
    # set line color to be black, and scatter color to cyan
    sns.regplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True,
                line_kws={"color": "black"},
                scatter_kws={"color": sns.color_palette()[5], "alpha": 0.5})
    plt.title("Career Approximate Value by Pick")
    plt.xlim(-5, 500)
    plt.ylim(-5, 200)
    plt.show()

我们也可以使用 lmplot 并设置 hue 为"Pos"，为每个位置拟合一条 Lowess 曲线。

In [41]:


    # Fit a LOWESS curver for each position
    sns.lmplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True, hue="Pos",
               size=10, scatter=False)
    plt.title("Career Approximate Value by Pick and Position")
    plt.xlim(-5, 500)
    plt.ylim(-1, 60)
    plt.show()

由于太多的线条，上面的图有点太乱了。我们可以实际将曲线分离出来，并单独绘制位置曲线。要在不设置 hue 为"Pos"的情况下做到这一点，我们可以设置 col 为"Pos"。要将所有图都组织到 5x3 格子里，我们必须设置 col_wrap 为 5。

In [44]:


    lm = sns.lmplot(x="Pick", y="CarAV", data=draft_df_2010, lowess=True, col="Pos",
                    col_wrap=5, size=4, line_kws={"color": "black"},
                    scatter_kws={"color": sns.color_palette()[5], "alpha": 0.7})

    # add title to the plot (which is a FacetGrid)
    # https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot 
    plt.subplots_adjust(top=0.9)
    lm.fig.suptitle("Career Approximate Value by Pick and Position",
                    fontsize=30)

    plt.xlim(-5, 500)
    plt.ylim(-1, 100)
    plt.show()