很难区分URL中的网络刮擦数据与棒球 - reference的数据

发布于 2025-02-02 05:34:00 字数 670 浏览 4 评论 0原文

def getURL(playerName):
    begURL = 'https://www.baseball-reference.com/players/'
    names = playerName.split()
    letter = names[1][0].lower()
    midURL = begURL + letter + '/'
    lastAbr = names[1][0:5].lower()
    firstAbr = names[0][0:2].lower()
    URL = midURL + lastAbr + firstAbr + '01.shtml'
    return URL 

该代码用于根据用户名的输入从棒球参考获得URL。有时这是行不通的,因为有多个具有相同名称的人,或者他们的名字已更改。我试图允许使用比较MLB球员并选择投手或击球手。

例如,如果您输入吉安卡洛·斯坦顿(Giancarlo Stanton),那么公元前有一个错误,他的名字叫迈克·斯坦顿(Mike Stanton)。虽然这是一个不经常发生的小问题,但有很多玩家具有相同名称,因此URL应在末尾更改为02或03 ...这些废料仅适用于获取少量统计数据从2022年开始,如果它是旧玩家,则刮擦无效,它们是错误的。

如果每个人都不遵循基本的URL模式,除了在每个人的特殊URL中编码外,是否可以更简单地考虑这一点?

def getURL(playerName):
    begURL = 'https://www.baseball-reference.com/players/'
    names = playerName.split()
    letter = names[1][0].lower()
    midURL = begURL + letter + '/'
    lastAbr = names[1][0:5].lower()
    firstAbr = names[0][0:2].lower()
    URL = midURL + lastAbr + firstAbr + '01.shtml'
    return URL 

This code is used to get the url from baseball reference based on a users input of player name. Sometimes this doesn't work, because there are multiple people with the same name or their name has changed. I am trying to allow uses to compare mlb players and choose either pitcher or batter.

For example, if you input Giancarlo Stanton, there is an error bc his name was Mike Stanton in the first two years of his career. While that is a minor problem that will not happen often, there are alot of players who have the same name, and therefore the URL should change at the end to 02 or 03... The scrap only works for getting a small amount of stats from 2022, so if it is an old player, the scrape doesnt work and their is an error.

Is there an easier way to account for this besides coding in everyones special URLs if they do not follow the basic URL pattern?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

忆沫 2025-02-09 05:34:00

最好的方法是使用播放器寄存器并使用播放器ID。您可以从sup_players_search_list.csv获取播放器ID。然后,这只是挑选想要的家伙的问题。

我添加了几个软件包fuzzywuzzy选择做到这一点。如果发生错别字,拼写错误等,这也将有助于。

代码:

from bs4 import BeautifulSoup
import requests
import pandas as pd

#pip install fuzzywuzzy
from fuzzywuzzy import process

#pip install choice
import choice



def askname():
    playerNameInput = input(str("Enter the player's name -> "))
    return playerNameInput


# Get all player IDs
player_df = pd.read_csv('https://www.baseball-reference.com/short/inc/sup_players_search_list.csv', header=None)
player_df = player_df.rename(columns={0:'id',
                                      1:'playerName',
                                      2:'years'})
playersList = list(player_df['playerName'])


# asks user for player name
playerNameInput = askname()


# Find closest matches
search_match = pd.DataFrame(process.extract(f'{playerNameInput}', playersList))
search_match = search_match.rename(columns={0:'playerName',1:'matchScore'})

matches = pd.merge(search_match, player_df, how='inner', on='playerName').drop_duplicates().reset_index(drop=True)
choices = [': '.join(x) for x in list(zip(matches['playerName'], matches['years']))]

# Choice the match
playerChoice = choice.Menu(choices).ask()
playerName, years = playerChoice.split(': ')

# Get that match players id
match = player_df[(player_df['playerName'] == playerName) & (player_df['years'] == years)]

baseUrl = 'https://www.baseball-reference.com/register/player.fcgi?id='
playerId = match.iloc[0]['id']

url = f'{baseUrl}{playerId}'



# Get the data
response = requests.get(url)
html = response.text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')

tables_dict = {}
tables = soup.find_all('table')
for table in tables:
    stat_type = table.find('caption').text.strip()
    df = pd.read_html(str(table))[0]
    
    tables_dict[stat_type] = df

for tableName, table in tables_dict.items():
    print(f'\n\n*** {tableName} ***')
    print(table.to_string())

输出:

Enter the player's name -> Giancarlo Stanton
Make a choice:
 0: Giancarlo Stanton: 2010-2022
 1: Tom Stanton: 1904
 2: Carl Husta: 1925
 3: Joe Staton: 1972-1973
 4: Carl Sitton: 1909

Enter number or name; return for next page

? 0

和数据:

*** Baseball America ***
          0    1
0  Pre-2009  #16
1  Pre-2010   #3


*** Baseball Prospectus ***
          0    1
0  Pre-2009  #14
1  Pre-2010   #5


*** Futures Game ***
                   0     1
0  2009 Futures Game  U.S.


*** Register Batting ***
                       Year                      Age  ...   SF  IBB
0                      2007                       17  ...    1    0
1                      2007                       17  ...    1    0
2                      2007                       17  ...    0    0
3                      2008                       18  ...    3    7
4                      2009                       19  ...    6    1
5                      2009                       19  ...    5    1
6                      2009                       19  ...    1    0
7                      2009                       19  ...    0    0
8                      2010                       20  ...    2   10
9                      2010                       20  ...    1    6
10                     2011                       21  ...    6    6
11                     2012                       22  ...    0    0
12                     2012                       22  ...    1    9
13                     2013                       23  ...    0    0
14                     2013                       23  ...    1    5
15                     2014                       24  ...    2   24
16                     2015                       25  ...    0    0
17                     2015                       25  ...    3    6
18                     2016                       26  ...    2    5
19                     2017                       27  ...    3   13
20                     2018                       28  ...   10    5
21                     2019                       29  ...    0    0
22                     2019                       29  ...    0    0
23                     2019                       29  ...    0    0
24                     2019                       29  ...    1    0
25                     2020                       30  ...    0    1
26                     2021                       31  ...    3    1
27                     2022                       32  ...    3    2
28                     Year                      Age  ...   SF  IBB
29      Majors (13 seasons)      Majors (13 seasons)  ...   36   83
30       Minors (8 seasons)       Minors (8 seasons)  ...   12   18
31         Other (1 season)         Other (1 season)  ...    0    0
32  All Levels (16 Seasons)  All Levels (16 Seasons)  ...   48  101
33                      NaN                      NaN  ...  NaN  NaN
34           AAA (1 season)           AAA (1 season)  ...    0    0
35           AA (2 seasons)           AA (2 seasons)  ...    7   11
36           A+ (5 seasons)           A+ (5 seasons)  ...    1    0
37             A (1 season)             A (1 season)  ...    3    7
38            A- (1 season)            A- (1 season)  ...    1    0
39            Rk (1 season)            Rk (1 season)  ...    0    0

[40 rows x 30 columns]


*** Register Fielding ***
                       Year                      Age  ... lgCS% PO.1
0                      2007                       17  ...   NaN  NaN
1                      2007                       17  ...   NaN  NaN
2                      2007                       17  ...   NaN  NaN
3                      2007                       17  ...   NaN  NaN
4                      2007                       17  ...   NaN  NaN
..                      ...                      ...  ...   ...  ...
76   All Levels (3 Seasons)   All Levels (3 Seasons)  ...   NaN  NaN
77    All Levels (1 Season)    All Levels (1 Season)  ...   NaN  NaN
78   All Levels (5 Seasons)   All Levels (5 Seasons)  ...   NaN  NaN
79  All Levels (14 Seasons)  All Levels (14 Seasons)  ...   NaN  NaN
80  All Levels (15 Seasons)  All Levels (15 Seasons)  ...   NaN  NaN

[81 rows x 26 columns]


*** Teams Played For ***
    Year  Age                                Tm  ... Stint        From          To
0   2007   17                       GCL Marlins  ...   NaN  2007-08-16  2007-08-27
1   2007   17                 Jamestown Jammers  ...   NaN  2007-08-29  2007-09-07
2   2008   18           Greensboro Grasshoppers  ...   NaN  2008-04-03  2008-09-01
3   2009   19               Jupiter Hammerheads  ...   NaN  2009-04-09  2009-06-03
4   2009   19                 Jacksonville Suns  ...   NaN  2009-06-05  2009-09-07
5   2009   19                    Mesa Solar Sox  ...   NaN  2009-10-13  2009-10-22
6   2010   20                 Jacksonville Suns  ...   NaN  2010-04-08  2010-06-05
7   2010   20                   Florida Marlins  ...   1.0  2010-06-08  2010-10-03
8   2011   21                   Florida Marlins  ...   1.0  2011-04-01  2011-09-28
9   2012   22                     Miami Marlins  ...   1.0  2012-04-04  2012-10-03
10  2012   22               Jupiter Hammerheads  ...   NaN  2012-08-02  2012-08-05
11  2013   23                     Miami Marlins  ...   1.0  2013-04-01  2013-09-29
12  2013   23               Jupiter Hammerheads  ...   NaN  2013-06-04  2013-06-09
13  2014   24                     Miami Marlins  ...   1.0  2014-03-31  2014-09-11
14  2015   25                     Miami Marlins  ...   1.0  2015-04-06  2015-06-26
15  2015   25               Jupiter Hammerheads  ...   NaN  2015-09-01  2015-09-01
16  2016   26                     Miami Marlins  ...   1.0  2016-04-05  2016-10-02
17  2017   27                     Miami Marlins  ...   1.0  2017-04-03  2017-10-01
18  2018   28                  New York Yankees  ...   1.0  2018-03-29  2018-09-29
19  2019   29                  New York Yankees  ...   1.0  2019-03-28  2019-09-29
20  2019   29                     Tampa Tarpons  ...   NaN  2019-05-20  2019-06-12
21  2019   29  Scranton/Wilkes-Barre RailRiders  ...   NaN  2019-06-14  2019-06-16
22  2020   30                  New York Yankees  ...   1.0  2020-07-23  2020-09-26
23  2021   31                  New York Yankees  ...   1.0  2021-04-01  2021-10-03
24  2022   32                  New York Yankees  ...   1.0  2022-04-08  2022-05-24

[25 rows x 9 columns]

The best way to do it is to use the player register and use the player ids. You can get the player ids from the sup_players_search_list.csv. Then it's just a matter of picking the guy you want.

I added a couple packages fuzzywuzzy and choice to do that. This will help in the event of typos, misspellings, etc. as well.

Code:

from bs4 import BeautifulSoup
import requests
import pandas as pd

#pip install fuzzywuzzy
from fuzzywuzzy import process

#pip install choice
import choice



def askname():
    playerNameInput = input(str("Enter the player's name -> "))
    return playerNameInput


# Get all player IDs
player_df = pd.read_csv('https://www.baseball-reference.com/short/inc/sup_players_search_list.csv', header=None)
player_df = player_df.rename(columns={0:'id',
                                      1:'playerName',
                                      2:'years'})
playersList = list(player_df['playerName'])


# asks user for player name
playerNameInput = askname()


# Find closest matches
search_match = pd.DataFrame(process.extract(f'{playerNameInput}', playersList))
search_match = search_match.rename(columns={0:'playerName',1:'matchScore'})

matches = pd.merge(search_match, player_df, how='inner', on='playerName').drop_duplicates().reset_index(drop=True)
choices = [': '.join(x) for x in list(zip(matches['playerName'], matches['years']))]

# Choice the match
playerChoice = choice.Menu(choices).ask()
playerName, years = playerChoice.split(': ')

# Get that match players id
match = player_df[(player_df['playerName'] == playerName) & (player_df['years'] == years)]

baseUrl = 'https://www.baseball-reference.com/register/player.fcgi?id='
playerId = match.iloc[0]['id']

url = f'{baseUrl}{playerId}'



# Get the data
response = requests.get(url)
html = response.text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')

tables_dict = {}
tables = soup.find_all('table')
for table in tables:
    stat_type = table.find('caption').text.strip()
    df = pd.read_html(str(table))[0]
    
    tables_dict[stat_type] = df

for tableName, table in tables_dict.items():
    print(f'\n\n*** {tableName} ***')
    print(table.to_string())

Output:

Enter the player's name -> Giancarlo Stanton
Make a choice:
 0: Giancarlo Stanton: 2010-2022
 1: Tom Stanton: 1904
 2: Carl Husta: 1925
 3: Joe Staton: 1972-1973
 4: Carl Sitton: 1909

Enter number or name; return for next page

? 0

And the data:

*** Baseball America ***
          0    1
0  Pre-2009  #16
1  Pre-2010   #3


*** Baseball Prospectus ***
          0    1
0  Pre-2009  #14
1  Pre-2010   #5


*** Futures Game ***
                   0     1
0  2009 Futures Game  U.S.


*** Register Batting ***
                       Year                      Age  ...   SF  IBB
0                      2007                       17  ...    1    0
1                      2007                       17  ...    1    0
2                      2007                       17  ...    0    0
3                      2008                       18  ...    3    7
4                      2009                       19  ...    6    1
5                      2009                       19  ...    5    1
6                      2009                       19  ...    1    0
7                      2009                       19  ...    0    0
8                      2010                       20  ...    2   10
9                      2010                       20  ...    1    6
10                     2011                       21  ...    6    6
11                     2012                       22  ...    0    0
12                     2012                       22  ...    1    9
13                     2013                       23  ...    0    0
14                     2013                       23  ...    1    5
15                     2014                       24  ...    2   24
16                     2015                       25  ...    0    0
17                     2015                       25  ...    3    6
18                     2016                       26  ...    2    5
19                     2017                       27  ...    3   13
20                     2018                       28  ...   10    5
21                     2019                       29  ...    0    0
22                     2019                       29  ...    0    0
23                     2019                       29  ...    0    0
24                     2019                       29  ...    1    0
25                     2020                       30  ...    0    1
26                     2021                       31  ...    3    1
27                     2022                       32  ...    3    2
28                     Year                      Age  ...   SF  IBB
29      Majors (13 seasons)      Majors (13 seasons)  ...   36   83
30       Minors (8 seasons)       Minors (8 seasons)  ...   12   18
31         Other (1 season)         Other (1 season)  ...    0    0
32  All Levels (16 Seasons)  All Levels (16 Seasons)  ...   48  101
33                      NaN                      NaN  ...  NaN  NaN
34           AAA (1 season)           AAA (1 season)  ...    0    0
35           AA (2 seasons)           AA (2 seasons)  ...    7   11
36           A+ (5 seasons)           A+ (5 seasons)  ...    1    0
37             A (1 season)             A (1 season)  ...    3    7
38            A- (1 season)            A- (1 season)  ...    1    0
39            Rk (1 season)            Rk (1 season)  ...    0    0

[40 rows x 30 columns]


*** Register Fielding ***
                       Year                      Age  ... lgCS% PO.1
0                      2007                       17  ...   NaN  NaN
1                      2007                       17  ...   NaN  NaN
2                      2007                       17  ...   NaN  NaN
3                      2007                       17  ...   NaN  NaN
4                      2007                       17  ...   NaN  NaN
..                      ...                      ...  ...   ...  ...
76   All Levels (3 Seasons)   All Levels (3 Seasons)  ...   NaN  NaN
77    All Levels (1 Season)    All Levels (1 Season)  ...   NaN  NaN
78   All Levels (5 Seasons)   All Levels (5 Seasons)  ...   NaN  NaN
79  All Levels (14 Seasons)  All Levels (14 Seasons)  ...   NaN  NaN
80  All Levels (15 Seasons)  All Levels (15 Seasons)  ...   NaN  NaN

[81 rows x 26 columns]


*** Teams Played For ***
    Year  Age                                Tm  ... Stint        From          To
0   2007   17                       GCL Marlins  ...   NaN  2007-08-16  2007-08-27
1   2007   17                 Jamestown Jammers  ...   NaN  2007-08-29  2007-09-07
2   2008   18           Greensboro Grasshoppers  ...   NaN  2008-04-03  2008-09-01
3   2009   19               Jupiter Hammerheads  ...   NaN  2009-04-09  2009-06-03
4   2009   19                 Jacksonville Suns  ...   NaN  2009-06-05  2009-09-07
5   2009   19                    Mesa Solar Sox  ...   NaN  2009-10-13  2009-10-22
6   2010   20                 Jacksonville Suns  ...   NaN  2010-04-08  2010-06-05
7   2010   20                   Florida Marlins  ...   1.0  2010-06-08  2010-10-03
8   2011   21                   Florida Marlins  ...   1.0  2011-04-01  2011-09-28
9   2012   22                     Miami Marlins  ...   1.0  2012-04-04  2012-10-03
10  2012   22               Jupiter Hammerheads  ...   NaN  2012-08-02  2012-08-05
11  2013   23                     Miami Marlins  ...   1.0  2013-04-01  2013-09-29
12  2013   23               Jupiter Hammerheads  ...   NaN  2013-06-04  2013-06-09
13  2014   24                     Miami Marlins  ...   1.0  2014-03-31  2014-09-11
14  2015   25                     Miami Marlins  ...   1.0  2015-04-06  2015-06-26
15  2015   25               Jupiter Hammerheads  ...   NaN  2015-09-01  2015-09-01
16  2016   26                     Miami Marlins  ...   1.0  2016-04-05  2016-10-02
17  2017   27                     Miami Marlins  ...   1.0  2017-04-03  2017-10-01
18  2018   28                  New York Yankees  ...   1.0  2018-03-29  2018-09-29
19  2019   29                  New York Yankees  ...   1.0  2019-03-28  2019-09-29
20  2019   29                     Tampa Tarpons  ...   NaN  2019-05-20  2019-06-12
21  2019   29  Scranton/Wilkes-Barre RailRiders  ...   NaN  2019-06-14  2019-06-16
22  2020   30                  New York Yankees  ...   1.0  2020-07-23  2020-09-26
23  2021   31                  New York Yankees  ...   1.0  2021-04-01  2021-10-03
24  2022   32                  New York Yankees  ...   1.0  2022-04-08  2022-05-24

[25 rows x 9 columns]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文