我正在为数据科学项目抓取棒球参考,并在试图从特定联赛中刮掉玩家数据时出现了一个问题。 JSUT本赛季开始比赛的联赛。当我刮擦已经完成比赛的旧联赛时,我没有任何问题。但是我想在此链接上刮擦联赛:随着赛季的发展而活。但是,链接隐藏在许多似乎是纯文本的背后。因此,beautifulsoup.find_all('a',href = true)不起作用。
因此,这是我到目前为止的思考过程。
html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')
team_links = []
for i in range(count):
# rn finds the same one over and over
link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
# try to remove it from orig_ind and do the next link...
# this is the part that is not working rn
orig_ind = orig_ind.replace(link, '')
team_links.append('https://baseball-reference.com' + link)
输出:
['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
等等。我正在尝试从此页面获取所有团队链接: https:> https: //www.baseball-reference.com/register/league.cgi?id=c346199a
,然后爬网上到每个页面上的播放器链接并收集一些数据。就像我说的那样,除此之外,它几乎适用于我尝试过的每个联赛。
任何帮助都是大大应用的。
I am scraping baseball reference for a data science project, and have come a cross an issue when trying to scrape player data from a specific league. A league that jsut started playing this season. When I scrape old leagues that have already finished playing I have no issues. But I want to scrape the league at this link: https://www.baseball-reference.com/register/league.cgi?id=c346199a live as the season goes. However the links are hidden behind a lot of what seem to be plain text. So BeautifulSoup.find_all('a', href = True) does not work.
So instead here is what my thought process has been so far.
html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')
team_links = []
for i in range(count):
# rn finds the same one over and over
link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
# try to remove it from orig_ind and do the next link...
# this is the part that is not working rn
orig_ind = orig_ind.replace(link, '')
team_links.append('https://baseball-reference.com' + link)
Which outputs:
['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
and so on. I am trying to get all of the team links from this page: https://www.baseball-reference.com/register/league.cgi?id=c346199a
and then crawl over to the player links on each of those pages and collect some data. Like I said it works on pretty much every single league I have ever tried on except for this one.
Any help is greatly appriciated.
发布评论
评论(1)
您在此站点上看到的表位于HTML注释(
<! - ... - >
)中,因此很漂亮,因此通常不会看到它们。要解析他们尝试下一个示例:打印:
The tables you see on this site is stored inside HTML comments (
<!-- ... -->
) so BeautifulSoup normally doesn't see them. To parse them try next example:Prints: