如何使用 Selenium 按 class_name 从大学橄榄球数据中抓取图像 url 列表
我正在尝试从大学橄榄球名册网站上抓取球员数据。我主要对获取玩家图像、体重和姓名感兴趣。我已经能够提取重量和名称,但正在努力使用硒提取图像。这是我到目前为止的代码。
driver = webdriver.Chrome("C:/Users/<my_user>/Downloads/chromedriver.exe")
driver.get(school["url"])
all_names = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-name')
all_weights = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-weight')
all_imgs = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-image')
这是作为 school[url]
传入的示例。许多大学都使用这种格式。 https://rolltide.com/sports/football/roster 该网站上的每个播放器都有以下 html 元素。
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/football/roster/jeremiah-alexander/8141" aria-label="Jeremiah Alexander - View Full Bio" title="View Full Bio">
<img class=" lazyloaded" data-src="https://d1a8hwz3c6qyrc.cloudfront.net/images/2022/3/1/Alexander_Jeremiah.jpg?width=80" alt="Jeremiah Alexander" src="https://d1a8hwz3c6qyrc.cloudfront.net/images/2022/3/1/Alexander_Jeremiah.jpg?width=80">
</a>
</div>
我遇到的问题是 all_imgs
中的每个 webElement
似乎没有名为“img”的属性或我可以看到的代表图像链接的任何属性元素内。如何获取此页面上所有球员图片的链接?
I'm trying to scrape player data from college football roster sites. I am primarily interested in getting the player image, weight, and name. I have already been able to extract the weight and name but am struggling on extracting the image using selenium. This is my code so far.
driver = webdriver.Chrome("C:/Users/<my_user>/Downloads/chromedriver.exe")
driver.get(school["url"])
all_names = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-name')
all_weights = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-weight')
all_imgs = driver.find_elements(by=By.CLASS_NAME, value='sidearm-roster-player-image')
This is an example of what would be passed in as school[url]
. Many colleges use this format.
https://rolltide.com/sports/football/roster
Each player on this site has the following html element.
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/football/roster/jeremiah-alexander/8141" aria-label="Jeremiah Alexander - View Full Bio" title="View Full Bio">
<img class=" lazyloaded" data-src="https://d1a8hwz3c6qyrc.cloudfront.net/images/2022/3/1/Alexander_Jeremiah.jpg?width=80" alt="Jeremiah Alexander" src="https://d1a8hwz3c6qyrc.cloudfront.net/images/2022/3/1/Alexander_Jeremiah.jpg?width=80">
</a>
</div>
The issue I am running into is that each webElement
in all_imgs
does not seem to have an attribute such called 'img' or any attribute I can see that represents the image link located within the element. How can I get the link of all the images of the players on this page?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
实际上...您可以在没有
selenium
的情况下获取所有数据。一切都在 HTML 源代码中。执行此操作的方法如下:
这应该输出:
Actually... you can get all the data without
selenium
. Everything is in the source HTML.Here's how to do this:
This should output:
要创建包含
src
属性的所有值的列表,您可以使用 < em>列表理解,您可以使用以下任一定位器策略:使用CSS_SELECTOR :
使用XPATH:
注意:您必须添加以下导入:
To create a list with all the values of the
src
attribute you can use list comprehension and you can use either of the following locator strategies:Using CSS_SELECTOR:
Using XPATH:
Note : You have to add the following imports :
接受 baduker 的解决方案,因为它是正确的。我只是添加它来回答您的后续问题。
2020 年名单抛出错误的原因是某些球员没有图像网址。有几种方法可以解决这个问题。我只是简单地获取 baduker 解析的 json,并在从中创建player_data 之前,我们将为缺少图像 url 的玩家填写
None
:Accept baduker's solution as it's correct. I am just adding on to it to answer your follow up.
The reason the 2020 roster is throwing an error is because some of the players do not have an image url. There's a few ways you could attack that. I simply just took the json that baduker parses, and priror to creating the player_data from that, we'll fill in
None
for the players missing an image url: