如何使用RVEST和XML2从网页中获取所有嵌套URL？

发布于 2025-02-04 06:43:55 字数 452 浏览 3 评论 0原文

我正在尝试从下面的网页中获取所有嵌套链接。我的代码下面返回一个空的字符向量。

page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>% 
  html_node("body") %>% 
  html_children()

page1_urls <- page1 %>%
  rvest::html_nodes('body') %>%
  xml2::xml_find_all("//div[contains(@class, 'd85qmy-0 kRbsKs')]") %>%
  rvest::html_attr('href')

预先感谢您为此提供帮助。

最好，
〜梅拉

原文

I'm trying to pull all nested links from the webpage below. My code below returns an empty character vector.

page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>% 
  html_node("body") %>% 
  html_children()

page1_urls <- page1 %>%
  rvest::html_nodes('body') %>%
  xml2::xml_find_all("//div[contains(@class, 'd85qmy-0 kRbsKs')]") %>%
  rvest::html_attr('href')

Thank you in advance for your help with this.

Best,
~Mayra

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悸初 2025-02-11 06:43:55

您要查找的链接在您正在阅读的HTML文档中不存在read_html。当您在浏览器中查看页面时，HTML文档包含浏览器运行的JavaScript代码。此JavaScript代码中的一些使您的浏览器下载更多信息，以插入您在浏览器上看到的网页。

就您而言，您要查找的额外信息是以JSON文件的形式进行的，您可以通过以下方式获取并解析：

library(httr)
library(dplyr)

url <- paste0("https://thrivemarket.com/api/v1/products",
              "?page_size=60&multifilter=1&cur_page=1")

content(GET(url))$products %>%
  lapply(function(x) data.frame(product = x$title, url = x$url)) %>%
  bind_rows() %>%
  as_tibble()
#> # A tibble: 60 x 2
#>    product                                          url                         
#>    <chr>                                            <chr>                       
#>  1 Organic Extra Virgin Olive Oil                   https://thrivemarket.com/p/~
#>  2 Grass-Fed Collagen Peptides                      https://thrivemarket.com/p/~
#>  3 Grass-Fed Beef Sticks, Original                  https://thrivemarket.com/p/~
#>  4 Organic Dry Roasted & Salted Cashews             https://thrivemarket.com/p/~
#>  5 Organic Vanilla Extract                          https://thrivemarket.com/p/~
#>  6 Organic Raw Cashews                              https://thrivemarket.com/p/~
#>  7 Organic Coconut Milk, Regular                    https://thrivemarket.com/p/~
#>  8 Organic Robust Maple Syrup, Grade A, Value Size  https://thrivemarket.com/p/~
#>  9 Organic Coconut Water                            https://thrivemarket.com/p/~
#> 10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
#> # ... with 50 more rows

^{在2022-06-04创建的 reprex软件包（v2.0.1）}

The links you are looking for do not exist in the html document you are reading with read_html. When you look at the page in a browser, the html document contains Javascript code, which your browser runs. Some of this Javascript code causes your browser to download further information to be inserted into the web page you see on your browser.

In your case, the extra information you are looking for is in the form of a json file, which you can obtain and parse as follows:

library(httr)
library(dplyr)

url <- paste0("https://thrivemarket.com/api/v1/products",
              "?page_size=60&multifilter=1&cur_page=1")

content(GET(url))$products %>%
  lapply(function(x) data.frame(product = x$title, url = x$url)) %>%
  bind_rows() %>%
  as_tibble()
#> # A tibble: 60 x 2
#>    product                                          url                         
#>    <chr>                                            <chr>                       
#>  1 Organic Extra Virgin Olive Oil                   https://thrivemarket.com/p/~
#>  2 Grass-Fed Collagen Peptides                      https://thrivemarket.com/p/~
#>  3 Grass-Fed Beef Sticks, Original                  https://thrivemarket.com/p/~
#>  4 Organic Dry Roasted & Salted Cashews             https://thrivemarket.com/p/~
#>  5 Organic Vanilla Extract                          https://thrivemarket.com/p/~
#>  6 Organic Raw Cashews                              https://thrivemarket.com/p/~
#>  7 Organic Coconut Milk, Regular                    https://thrivemarket.com/p/~
#>  8 Organic Robust Maple Syrup, Grade A, Value Size  https://thrivemarket.com/p/~
#>  9 Organic Coconut Water                            https://thrivemarket.com/p/~
#> 10 Non-GMO Avocado Oil Potato Chips, Himalayan Salt https://thrivemarket.com/p/~
#> # ... with 50 more rows

^{Created on 2022-06-04 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~