使用 Rvest 同时抓取网站列表

发布于 2025-01-10 16:24:07 字数 315 浏览 1 评论 0原文

我正在尝试抓取多个产品目录,每个链接都是指向不同产品的链接。

网页是包含链接的数据框。

webpages
"https............"
"https............"
"https............"

我有以下代码:

for (i in webpages){
    book_page <- read_html(link) 
}

我收到此错误错误:x必须是长度为1的字符串

我可以知道如何解决它吗?

I am trying to scrape multiple product catalogues and each link is the link towards a different product.

Webpages is a data frame containing the links.

webpages
"https............"
"https............"
"https............"

I have the following code:

for (i in webpages){
    book_page <- read_html(link) 
}

I got this error Error: x must be a string of length 1,

may I know how could I resolve it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时光磨忆 2025-01-17 16:24:07

for 循环不会按照问题标题的要求同时下载多个网站。但是,您可以使用并行化包,例如 pbmcapply

library(rvest)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(pbmcapply)
#> Loading required package: parallel

webpages <- list(
  "http://example.com",
  "https://stackoverflow.com/",
  "https://github.com/"
)

# download 3 webpages at the same time
contents <- pbmclapply(webpages, read_file, mc.cores = 3)
contents_html <- lapply(contents, read_html)
contents_html[[1]]
#> {html_document}
#> <html>
#> [1] <head>\n<title>Example Domain</title>\n<meta charset="utf-8">\n<meta http ...
#> [2] <body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use ...

< super>由 reprex 创建于 2022-03-01 package (v2.0.1)

read_html 必须在主线程中执行,以避免指针错误。

A for loop does not download multiple website at the same time as required by the title of your question. However, you can use a parallelization package e.g. pbmcapply:

library(rvest)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(pbmcapply)
#> Loading required package: parallel

webpages <- list(
  "http://example.com",
  "https://stackoverflow.com/",
  "https://github.com/"
)

# download 3 webpages at the same time
contents <- pbmclapply(webpages, read_file, mc.cores = 3)
contents_html <- lapply(contents, read_html)
contents_html[[1]]
#> {html_document}
#> <html>
#> [1] <head>\n<title>Example Domain</title>\n<meta charset="utf-8">\n<meta http ...
#> [2] <body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use ...

Created on 2022-03-01 by the reprex package (v2.0.1)

read_html must be executed in the main thread to circumvent pointer errors.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文