很难区分URL中的网络刮擦数据与棒球 - reference的数据
def getURL(playerName):
begURL = 'https://www.baseball-reference.com/players/'
names = playerName.split()
letter = names[1][0].lower()
midURL = begURL + letter + '/'
lastAbr = names[1][0:5].lower()
firstAbr = names[0][0:2].lower()
URL = midURL + lastAbr + firstAbr + '01.shtml'
return URL
该代码用于根据用户名的输入从棒球参考获得URL。有时这是行不通的,因为有多个具有相同名称的人,或者他们的名字已更改。我试图允许使用比较MLB球员并选择投手或击球手。
例如,如果您输入吉安卡洛·斯坦顿(Giancarlo Stanton),那么公元前有一个错误,他的名字叫迈克·斯坦顿(Mike Stanton)。虽然这是一个不经常发生的小问题,但有很多玩家具有相同名称,因此URL应在末尾更改为02或03 ...这些废料仅适用于获取少量统计数据从2022年开始,如果它是旧玩家,则刮擦无效,它们是错误的。
如果每个人都不遵循基本的URL模式,除了在每个人的特殊URL中编码外,是否可以更简单地考虑这一点?
def getURL(playerName):
begURL = 'https://www.baseball-reference.com/players/'
names = playerName.split()
letter = names[1][0].lower()
midURL = begURL + letter + '/'
lastAbr = names[1][0:5].lower()
firstAbr = names[0][0:2].lower()
URL = midURL + lastAbr + firstAbr + '01.shtml'
return URL
This code is used to get the url from baseball reference based on a users input of player name. Sometimes this doesn't work, because there are multiple people with the same name or their name has changed. I am trying to allow uses to compare mlb players and choose either pitcher or batter.
For example, if you input Giancarlo Stanton, there is an error bc his name was Mike Stanton in the first two years of his career. While that is a minor problem that will not happen often, there are alot of players who have the same name, and therefore the URL should change at the end to 02 or 03... The scrap only works for getting a small amount of stats from 2022, so if it is an old player, the scrape doesnt work and their is an error.
Is there an easier way to account for this besides coding in everyones special URLs if they do not follow the basic URL pattern?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最好的方法是使用播放器寄存器并使用播放器ID。您可以从sup_players_search_list.csv获取播放器ID。然后,这只是挑选想要的家伙的问题。
我添加了几个软件包
fuzzywuzzy
和选择
做到这一点。如果发生错别字,拼写错误等,这也将有助于。代码:
输出:
和数据:
The best way to do it is to use the player register and use the player ids. You can get the player ids from the sup_players_search_list.csv. Then it's just a matter of picking the guy you want.
I added a couple packages
fuzzywuzzy
andchoice
to do that. This will help in the event of typos, misspellings, etc. as well.Code:
Output:
And the data: