使用rvest软件包努力刮擦桌子

发布于 2025-02-11 01:53:34 字数 694 浏览 2 评论 0原文

我最近在长时间休息后再次开始使用R，并且我非常生锈，尤其是在HTML和刮擦数据（w/rvest）时。

我现在的主要问题是确定正确的节点/'xpath'以输入我的功能以获取正确的数据，更具体地说，我正在尝试从NCAA网站上刮擦大学曲棍球曲棍球逐渐播放的数据（示例）。

由于每个时期的数据似乎都嵌套在单独的“ Div.Div.By-play-Period”类中（？），所以我尝试着专注于刮擦一个时期，然后从那里进行构建...

所以我进行了钻探直到包含第一阶段数据（'tbody'）的表格并复制XPATH并粘贴到下面的代码中：

url <- "https://www.ncaa.com/game/5935492/play-by-play"

gm <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="gamecenterAppContent"]/div/div[3]/div[2]/table/tbody') %>%
  html_table()

导致“ 0”列表“ 0” ...

任何帮助都将不胜感激！

原文

I've recently starting using R again after a long hiatus and I'm extremely rusty, especially when it comes to html and scraping data (w/rvest).

My main issue right now is identifying the correct nodes/'XPath' to input into my function to get it to pull the correct data, more specifically, I'm trying to scrape college hockey play-by-play data from the NCAA website (example)....

I've tried numerous approaches without any success.. for example:

Since it seems each period's data is nested in separate "div.play-by-play-period" classes(?), I tried focusing on scraping one period, then building from there...

So I drilled down to the table containing the 1st period's data ('tbody') and copy the xpath and pasted into the code below:

url <- "https://www.ncaa.com/game/5935492/play-by-play"

gm <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="gamecenterAppContent"]/div/div[3]/div[2]/table/tbody') %>%
  html_table()

Resulting in a "List of 0"...

Any help would be greatly appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

世界和平 2025-02-18 01:53:34

您可以更容易用httr2致电其源。

library(tidyverse)
library(httr2)

data <- "https://data.ncaa.com/casablanca/game/5935492/pbp.json" %>%  
  request() %>%  
  req_perform() %>%  
  resp_body_json()

或使用JSONLITE :: Fromjson

jsonlite::fromJSON("https://data.ncaa.com/casablanca/game/5935492/pbp.json")

解析它

data <- jsonlite::fromJSON("https://data.ncaa.com/casablanca/game/5935492/pbp.json") %>% 
  .$periods %>% 
  .$playStats %>% 
  bind_rows() %>% 
  as_tibble()

# A tibble: 111 x 4
   score time  visitorText                                                  homeText      
   <chr> <chr> <chr>                                                        <chr>         
 1 ""    20:00 "Jack Watson at goalie for RPI."                             ""            
 2 ""    20:00 ""                                                           "Clay Stevens~
 3 ""    19:39 ""                                                           "Shot by DAR ~
 4 ""    19:36 "Penalty on Johnson, Jake RPI 2 minutes for CrossChecking."  ""            
 5 ""    18:04 ""                                                           "Shot by DAR ~
 6 ""    17:52 ""                                                           "Shot by DAR ~
 7 ""    16:25 ""                                                           "Shot by DAR ~
 8 ""    16:25 "Penalty on Dubinsky, Zach RPI 2 minutes for CrossChecking." ""            
 9 ""    15:55 ""                                                           "Shot by DAR ~
10 ""    14:24 ""                                                           "Shot by DAR ~
# ... with 101 more rows

Easier for you to call on their source with httr2.

library(tidyverse)
library(httr2)

data <- "https://data.ncaa.com/casablanca/game/5935492/pbp.json" %>%  
  request() %>%  
  req_perform() %>%  
  resp_body_json()

Or with jsonlite::fromJSON

jsonlite::fromJSON("https://data.ncaa.com/casablanca/game/5935492/pbp.json")

Parse it

data <- jsonlite::fromJSON("https://data.ncaa.com/casablanca/game/5935492/pbp.json") %>% 
  .$periods %>% 
  .$playStats %>% 
  bind_rows() %>% 
  as_tibble()

# A tibble: 111 x 4
   score time  visitorText                                                  homeText      
   <chr> <chr> <chr>                                                        <chr>         
 1 ""    20:00 "Jack Watson at goalie for RPI."                             ""            
 2 ""    20:00 ""                                                           "Clay Stevens~
 3 ""    19:39 ""                                                           "Shot by DAR ~
 4 ""    19:36 "Penalty on Johnson, Jake RPI 2 minutes for CrossChecking."  ""            
 5 ""    18:04 ""                                                           "Shot by DAR ~
 6 ""    17:52 ""                                                           "Shot by DAR ~
 7 ""    16:25 ""                                                           "Shot by DAR ~
 8 ""    16:25 "Penalty on Dubinsky, Zach RPI 2 minutes for CrossChecking." ""            
 9 ""    15:55 ""                                                           "Shot by DAR ~
10 ""    14:24 ""                                                           "Shot by DAR ~
# ... with 101 more rows

回复收藏 0 原文

~没有更多了~