网页刮擦数据未在网页上使用rvest显示

发布于 2025-02-12 04:19:26 字数 2367 浏览 2 评论 0原文

我正在尝试从Glassdoor刮擦玻璃门的评论，包括子评价（工作与生活平衡，文化和价值等）。子评价在下拉菜单中，显示为许多星星（1-5）。 dave2e在我以前的问题上发布了一个非常有用的解决方案，但是我发现某些公司的评论页面的格式不同，因此解决方案不起作用。不起作用的公司的一个示例。

library(stringr)
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

Subratings <- data.frame()
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-"
settings_url <- ".htm?filter.iso3Language=eng"

for (x in 1:3) {
     pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))
     
     #the ratings are stored in a data structure in a script
     #find all the scripts and then search
     scripts<-pg_reviews %>% html_elements(xpath='//script')
     
     #search the scripts for the ratings
     ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
     #filter the script down to just the data.  This is JSON like haven't figured out the beginning or end
     data1 <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":.+\\}\\}\\}\\}") 
     
     
     #extract the ratings
     WorkLifeBalance  <- str_extract_all(data1, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
     CultureAndValues <- str_extract_all(data1, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
     DiversityAndInclusion        <- str_extract_all(data1, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
     SeniorLeadership <- str_extract_all(data1, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
     CareerOpportunities <- str_extract_all(data1, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
     CompensationAndBenefits<- str_extract_all(data1, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
     
     #Combine columns
     combine <- cbind(WorkLifeBalance,CultureAndValues,DiversityAndInclusion,SeniorLeadership,
                      CareerOpportunities,CompensationAndBenefits)
     
     Subratings <- rbind(Subratings,combine)     
}

原文

This is a follow-up to a previous question: Scraping data using R and placing results in a data frame

I'm trying to scrape reviews from Glassdoor including the sub-ratings (work-life balance, culture and values, etc). The sub-ratings are in a drop down menu and are displayed as a number of stars (1-5). Dave2e posted a very helpful solution to my previous question, but I've found that some companies' review pages are formatted differently so that the solution doesn't work. An example of a company where it doesn't work is below.

library(stringr)
library(httr)  
library(xml2)  
library(rvest) 
library(purrr) 
library(tidyverse)
library(lubridate)

Subratings <- data.frame()
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-"
settings_url <- ".htm?filter.iso3Language=eng"

for (x in 1:3) {
     pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))
     
     #the ratings are stored in a data structure in a script
     #find all the scripts and then search
     scripts<-pg_reviews %>% html_elements(xpath='//script')
     
     #search the scripts for the ratings
     ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
     #filter the script down to just the data.  This is JSON like haven't figured out the beginning or end
     data1 <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":.+\\}\\}\\}\\}") 
     
     
     #extract the ratings
     WorkLifeBalance  <- str_extract_all(data1, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
     CultureAndValues <- str_extract_all(data1, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
     DiversityAndInclusion        <- str_extract_all(data1, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
     SeniorLeadership <- str_extract_all(data1, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
     CareerOpportunities <- str_extract_all(data1, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
     CompensationAndBenefits<- str_extract_all(data1, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
     
     #Combine columns
     combine <- cbind(WorkLifeBalance,CultureAndValues,DiversityAndInclusion,SeniorLeadership,
                      CareerOpportunities,CompensationAndBenefits)
     
     Subratings <- rbind(Subratings,combine)     
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拥醉 2025-02-19 04:19:26

看来此页面的关闭额额少，请尝试：str_extract（“ \” urlparams \“：。+\\} \\} \\} \\}”）。
。
这也应该在前面的页面上工作。

经过大量搜索后，员工评论以“评论”开头存储在字符串中：并以}]}。
通过添加引导{将评论变成有效的JSON，从而简单地转换。

library(stringr) 
library(httr)
library(xml2)
library(rvest) 
library(dplyr)

Subratings <- data.frame() 
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-" 
settings_url <- ".htm?filter.iso3Language=eng"

dfs <- lapply(1:3, function(x) { 
   pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))

   #the ratings are stored in a data structure in a script
   #find all the scripts and then search
   scripts<-pg_reviews %>% html_elements(xpath='//script')

   #search the scripts for the ratings
   ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))

   #Extract text for the reviews from the script.  This is almost valid JSON format
   reviews <-scripts[ratingsScript] %>% html_text2() %>% 
                          str_extract("\"reviews\":.+?\\}\\]\\}") 
  # char <- nchar(reviews)  #debugging status

   #add a leading { to make valid JSON and convert
   answer <-jsonlite::fromJSON(paste("{", reviews))
   answer
})

bind_rows(dfs)

It looks like this page has one less closing parenthesis, try: str_extract("\"urlParams\":.+\\}\\}\\}").
This should work on the previous pages also.

After much searching, the employee reviews are stored in the string starting with "reviews": and ending with }]}.
By adding a leading { to this turns the reviews into valid JSON and thus a simple conversion.

library(stringr) 
library(httr)
library(xml2)
library(rvest) 
library(dplyr)

Subratings <- data.frame() 
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-" 
settings_url <- ".htm?filter.iso3Language=eng"

dfs <- lapply(1:3, function(x) { 
   pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))

   #the ratings are stored in a data structure in a script
   #find all the scripts and then search
   scripts<-pg_reviews %>% html_elements(xpath='//script')

   #search the scripts for the ratings
   ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))

   #Extract text for the reviews from the script.  This is almost valid JSON format
   reviews <-scripts[ratingsScript] %>% html_text2() %>% 
                          str_extract("\"reviews\":.+?\\}\\]\\}") 
  # char <- nchar(reviews)  #debugging status

   #add a leading { to make valid JSON and convert
   answer <-jsonlite::fromJSON(paste("{", reviews))
   answer
})

bind_rows(dfs)

回复收藏 0 原文

~没有更多了~