网络刮擦mutiple Pages r，r bind Broke Broke Broke Webscraping

发布于 2025-02-12 19:17:41 字数 1430 浏览 1 评论 0原文

我正在网站上贴上网站，当我尝试将数据集中的所有列引用时，我会遇到麻烦。它给了我错误。它之所以这样说，是因为数据集中的列之间的行数不同，例如价格为25个元素，描述中有24个元素。

{if(length(.) == 0) NA else .}

当Webscrape程序找不到值时，我试图将上面的代码放置在上面放置NAS，但看起来它不起作用，我将完整的代码留在下面。

urls <- sprintf("https://www.immobiliare.it/vendita-case/milano/?pag=%d", 1:7)

case <- data.frame() 

for (i in urls){
  page <- read_html(i)
  
  Scrape <- page  %>% html_nodes(xpath= "//ul[@class='nd-list in-realEstateResults']") %>% 
    purrr::map_df(~list(description= html_nodes(.x, xpath= "//a[@class='in-card__title']") %>% html_text() %>% length() %>% {if(length(.) == 0) NA else .}, #Returns NA for missing data
                        
                        price= html_nodes(.x, xpath= "//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        rooms = html_nodes(.x,xpath= "//li[@aria-label='locali']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        area= html_nodes(.x,xpath= "//li[@aria-label='superficie']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .}))
  
  
  temp <- data.frame(Scrape)
  case <- rbind(temp, case)
  
  print(paste("Page:",i))
}

有什么建议吗？让我知道您是否有任何疑问

原文

I am webscraping this website, I am having troubles when I try to rbind all the columns in the datasets. it gives me error. it says because different number of rows between columns in the dataset, for example 25 elements in price and 24 in description.

{if(length(.) == 0) NA else .}

I tried to put the piece of code above to put NAs when the webscrape program doesn't find values but it looks it doesn't work, I leave the full code below.

urls <- sprintf("https://www.immobiliare.it/vendita-case/milano/?pag=%d", 1:7)

case <- data.frame() 

for (i in urls){
  page <- read_html(i)
  
  Scrape <- page  %>% html_nodes(xpath= "//ul[@class='nd-list in-realEstateResults']") %>% 
    purrr::map_df(~list(description= html_nodes(.x, xpath= "//a[@class='in-card__title']") %>% html_text() %>% length() %>% {if(length(.) == 0) NA else .}, #Returns NA for missing data
                        
                        price= html_nodes(.x, xpath= "//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        rooms = html_nodes(.x,xpath= "//li[@aria-label='locali']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .},
                        
                        area= html_nodes(.x,xpath= "//li[@aria-label='superficie']") %>% html_text(trim = TRUE) %>% {if(length(.) == 0) NA else .}))
  
  
  temp <- data.frame(Scrape)
  case <- rbind(temp, case)
  
  print(paste("Page:",i))
}

any suggestions?
let me know if you have any questions

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぃ弥猫深巷。 2025-02-19 19:17:41

如果您查看我在评论中提到的问题。
您面临的问题并非每个清单节点都在您要寻找的所有信息中都有，从而产生非相等长度错误。

处理这些情况的最佳方法是在列表/向量中找到所有父节点，然后使用html_node（没有“ s”）函数。 html_node（）即使是na，每个节点也将始终返回1个结果。 html_nodes（）将不返回。
有关更多信息，请参见评论。

library(rvest)
library(dplyr)

df_apartments <- list()
for (i in 1:7) { 
   #read page
   page <- read_html(paste0("https://www.immobiliare.it/vendita-case/milano/?pag=", i))
   
   #read the parent nodes
   apartments <- page  %>% html_nodes(xpath= "//div[@class='nd-mediaObject__content in-card__content in-realEstateListCard__content']")
   
# parse information from each of the parent nodes
  price <- apartments %>% html_node(xpath= ".//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE)
  rooms <- apartments %>% html_node(xpath= ".//li[@aria-label='locali']") %>% html_text(trim = TRUE)
  area <- apartments %>% html_nodes(xpath= ".//li[@aria-label='superficie']") %>% html_text(trim = TRUE)
  description <-  apartments %>% html_node( xpath= ".//a[@class='in-card__title']") %>% html_text()
      
# put the data together into a data frame add to list                  
   df_apartments[[i]] <- data.frame(price, rooms, area, description)
}
#combine all data frames into 1
answer <- bind_rows(df_apartments)


df_apartments
         price rooms  area                                                            description
1    € 900.000     3 125m²                     Trilocale via Orti 2, Quadronno - Crocetta, Milano
2    € 275.000     2  50m²                Bilocale via Gian Francesco Pizzi 34, Ripamonti, Milano
3    € 275.000     2  55m²                 Bilocale viale dei Mille 14, Plebisciti - Susa, Milano
4    € 799.000     4 135m²                 Quadrilocale via Beato Angelico 3, Città Studi, Milano
5    € 210.000     2  65m²                    Bilocale piazza Monte Falterona 5, San Siro, Milano
6    € 240.000     2  50m²                Bilocale via dell'Assunta 5, Vigentino - Fatima, Milano
7    € 395.000     3  90m² Trilocale via Cuore Immacolato di Maria 12, Vigentino - Fatima, Milano

更新
因为我们使用的是XPATH选项，所以我们需要添加一个“”。在“ //”之前，请告诉XPath解析器以当前节点启动并注意第一个节点。

If you review the question which I reference in my comment.
The problem you are facing, is not every listing node has in all of the information you are looking for thus generating the non equal length errors.

The best way to handle these situations is to find all parent nodes in a list/vector and then extract the desired information from each parent using the html_node (without the "s") function. html_node() will always return 1 result for every node, even if it is NA. html_nodes() will return nothing.
See comments for more information.

library(rvest)
library(dplyr)

df_apartments <- list()
for (i in 1:7) { 
   #read page
   page <- read_html(paste0("https://www.immobiliare.it/vendita-case/milano/?pag=", i))
   
   #read the parent nodes
   apartments <- page  %>% html_nodes(xpath= "//div[@class='nd-mediaObject__content in-card__content in-realEstateListCard__content']")
   
# parse information from each of the parent nodes
  price <- apartments %>% html_node(xpath= ".//li[@class='nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main']") %>% html_text(trim = TRUE)
  rooms <- apartments %>% html_node(xpath= ".//li[@aria-label='locali']") %>% html_text(trim = TRUE)
  area <- apartments %>% html_nodes(xpath= ".//li[@aria-label='superficie']") %>% html_text(trim = TRUE)
  description <-  apartments %>% html_node( xpath= ".//a[@class='in-card__title']") %>% html_text()
      
# put the data together into a data frame add to list                  
   df_apartments[[i]] <- data.frame(price, rooms, area, description)
}
#combine all data frames into 1
answer <- bind_rows(df_apartments)


df_apartments
         price rooms  area                                                            description
1    € 900.000     3 125m²                     Trilocale via Orti 2, Quadronno - Crocetta, Milano
2    € 275.000     2  50m²                Bilocale via Gian Francesco Pizzi 34, Ripamonti, Milano
3    € 275.000     2  55m²                 Bilocale viale dei Mille 14, Plebisciti - Susa, Milano
4    € 799.000     4 135m²                 Quadrilocale via Beato Angelico 3, Città Studi, Milano
5    € 210.000     2  65m²                    Bilocale piazza Monte Falterona 5, San Siro, Milano
6    € 240.000     2  50m²                Bilocale via dell'Assunta 5, Vigentino - Fatima, Milano
7    € 395.000     3  90m² Trilocale via Cuore Immacolato di Maria 12, Vigentino - Fatima, Milano

Update
Because we are using the xpath option, we need to add a "." prior to the "//" to tell the xpath parser to start at the current node and note the first node.

回复收藏 0 原文

~没有更多了~