网页刮擦数据未在网页上使用rvest显示
这是对先前问题的后续措施:使用R刮擦数据并将其放置在数据框架中
我正在尝试从Glassdoor刮擦玻璃门的评论,包括子评价(工作与生活平衡,文化和价值等)。子评价在下拉菜单中,显示为许多星星(1-5)。 dave2e在我以前的问题上发布了一个非常有用的解决方案,但是我发现某些公司的评论页面的格式不同,因此解决方案不起作用。不起作用的公司的一个示例。
library(stringr)
library(httr)
library(xml2)
library(rvest)
library(purrr)
library(tidyverse)
library(lubridate)
Subratings <- data.frame()
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-"
settings_url <- ".htm?filter.iso3Language=eng"
for (x in 1:3) {
pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-pg_reviews %>% html_elements(xpath='//script')
#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#filter the script down to just the data. This is JSON like haven't figured out the beginning or end
data1 <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":.+\\}\\}\\}\\}")
#extract the ratings
WorkLifeBalance <- str_extract_all(data1, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
CultureAndValues <- str_extract_all(data1, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
DiversityAndInclusion <- str_extract_all(data1, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
SeniorLeadership <- str_extract_all(data1, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
CareerOpportunities <- str_extract_all(data1, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
CompensationAndBenefits<- str_extract_all(data1, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
#Combine columns
combine <- cbind(WorkLifeBalance,CultureAndValues,DiversityAndInclusion,SeniorLeadership,
CareerOpportunities,CompensationAndBenefits)
Subratings <- rbind(Subratings,combine)
}
This is a follow-up to a previous question: Scraping data using R and placing results in a data frame
I'm trying to scrape reviews from Glassdoor including the sub-ratings (work-life balance, culture and values, etc). The sub-ratings are in a drop down menu and are displayed as a number of stars (1-5). Dave2e posted a very helpful solution to my previous question, but I've found that some companies' review pages are formatted differently so that the solution doesn't work. An example of a company where it doesn't work is below.
library(stringr)
library(httr)
library(xml2)
library(rvest)
library(purrr)
library(tidyverse)
library(lubridate)
Subratings <- data.frame()
url <- "https://www.glassdoor.com/Reviews/Fresenius-Medical-Care-North-America-Reviews-"
settings_url <- ".htm?filter.iso3Language=eng"
for (x in 1:3) {
pg_reviews <- read_html(GET(paste(url, "E10445", "_P", x, settings_url, sep = "")))
#the ratings are stored in a data structure in a script
#find all the scripts and then search
scripts<-pg_reviews %>% html_elements(xpath='//script')
#search the scripts for the ratings
ratingsScript <- which(grepl("ratingCareerOpportunities", scripts))
#filter the script down to just the data. This is JSON like haven't figured out the beginning or end
data1 <-scripts[ratingsScript] %>% html_text2() %>% str_extract("\"urlParams\":.+\\}\\}\\}\\}")
#extract the ratings
WorkLifeBalance <- str_extract_all(data1, '(?<="ratingWorkLifeBalance":)\\d') %>% unlist() %>% as.integer()
CultureAndValues <- str_extract_all(data1, '(?<="ratingCultureAndValues":)\\d') %>% unlist() %>% as.integer()
DiversityAndInclusion <- str_extract_all(data1, '(?<="ratingDiversityAndInclusion":)\\d') %>% unlist() %>% as.integer()
SeniorLeadership <- str_extract_all(data1, '(?<="ratingSeniorLeadership":)\\d') %>% unlist() %>% as.integer()
CareerOpportunities <- str_extract_all(data1, '(?<="ratingCareerOpportunities":)\\d') %>% unlist() %>% as.integer()
CompensationAndBenefits<- str_extract_all(data1, '(?<="ratingCompensationAndBenefits":)\\d') %>% unlist() %>% as.integer()
#Combine columns
combine <- cbind(WorkLifeBalance,CultureAndValues,DiversityAndInclusion,SeniorLeadership,
CareerOpportunities,CompensationAndBenefits)
Subratings <- rbind(Subratings,combine)
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看来此页面的关闭额额少,请尝试:
str_extract(“ \” urlparams \“:。+\\} \\} \\} \\}”)
。。
这也应该在前面的页面上工作。
经过大量搜索后,员工评论以“评论”开头存储在字符串中:并以}]}。
通过添加引导{将评论变成有效的JSON,从而简单地转换。
It looks like this page has one less closing parenthesis, try:
str_extract("\"urlParams\":.+\\}\\}\\}")
.This should work on the previous pages also.
After much searching, the employee reviews are stored in the string starting with "reviews": and ending with }]}.
By adding a leading { to this turns the reviews into valid JSON and thus a simple conversion.