如何使用RVEST和XML2从网页中获取所有嵌套URL?
我正在尝试从下面的网页中获取所有嵌套链接。我的代码下面返回一个空的字符向量。
page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>%
html_node("body") %>%
html_children()
page1_urls <- page1 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(@class, 'd85qmy-0 kRbsKs')]") %>%
rvest::html_attr('href')
预先感谢您为此提供帮助。
最好,
〜梅拉
I'm trying to pull all nested links from the webpage below. My code below returns an empty character vector.
page1 <- "https://thrivemarket.com/c/condiments-sauces?cur_page=1"
page1 <- read_html(page1)
page1_body <- page1 %>%
html_node("body") %>%
html_children()
page1_urls <- page1 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(@class, 'd85qmy-0 kRbsKs')]") %>%
rvest::html_attr('href')
Thank you in advance for your help with this.
Best,
~Mayra
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您要查找的链接在您正在阅读的HTML文档中不存在
read_html
。当您在浏览器中查看页面时,HTML文档包含浏览器运行的JavaScript代码。此JavaScript代码中的一些使您的浏览器下载更多信息,以插入您在浏览器上看到的网页。就您而言,您要查找的额外信息是以JSON文件的形式进行的,您可以通过以下方式获取并解析:
在2022-06-04创建的 reprex软件包(v2.0.1)
The links you are looking for do not exist in the html document you are reading with
read_html
. When you look at the page in a browser, the html document contains Javascript code, which your browser runs. Some of this Javascript code causes your browser to download further information to be inserted into the web page you see on your browser.In your case, the extra information you are looking for is in the form of a json file, which you can obtain and parse as follows:
Created on 2022-06-04 by the reprex package (v2.0.1)