使用rvest软件包努力刮擦桌子
我最近在长时间休息后再次开始使用R,并且我非常生锈,尤其是在HTML和刮擦数据(w/rvest)时。
我现在的主要问题是确定正确的节点/'xpath'以输入我的功能以获取正确的数据,更具体地说,我正在尝试从NCAA网站上刮擦大学曲棍球曲棍球逐渐播放的数据( 示例)。
由于每个时期的数据似乎都嵌套在单独的“ Div.Div.By-play-Period”类中(?),所以我尝试着专注于刮擦一个时期,然后从那里进行构建...
所以我进行了钻探直到包含第一阶段数据('tbody')的表格并复制XPATH并粘贴到下面的代码中:
url <- "https://www.ncaa.com/game/5935492/play-by-play"
gm <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="gamecenterAppContent"]/div/div[3]/div[2]/table/tbody') %>%
html_table()
导致“ 0”列表“ 0” ...
任何帮助都将不胜感激!
I've recently starting using R again after a long hiatus and I'm extremely rusty, especially when it comes to html and scraping data (w/rvest).
My main issue right now is identifying the correct nodes/'XPath' to input into my function to get it to pull the correct data, more specifically, I'm trying to scrape college hockey play-by-play data from the NCAA website (example)....
I've tried numerous approaches without any success.. for example:
Since it seems each period's data is nested in separate "div.play-by-play-period" classes(?), I tried focusing on scraping one period, then building from there...
So I drilled down to the table containing the 1st period's data ('tbody') and copy the xpath and pasted into the code below:
url <- "https://www.ncaa.com/game/5935492/play-by-play"
gm <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="gamecenterAppContent"]/div/div[3]/div[2]/table/tbody') %>%
html_table()
Resulting in a "List of 0"...
Any help would be greatly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以更容易用
httr2
致电其源。或使用
JSONLITE :: Fromjson
解析它
Easier for you to call on their source with
httr2
.Or with
jsonlite::fromJSON
Parse it