使用 Rvest 同时抓取网站列表

发布于 2025-01-10 16:24:07 字数 315 浏览 1 评论 0原文

我正在尝试抓取多个产品目录，每个链接都是指向不同产品的链接。

网页是包含链接的数据框。

webpages
"https............"
"https............"
"https............"

我有以下代码：

for (i in webpages){
    book_page <- read_html(link) 
}

我收到此错误错误：x必须是长度为1的字符串，

我可以知道如何解决它吗？

原文

I am trying to scrape multiple product catalogues and each link is the link towards a different product.

Webpages is a data frame containing the links.

webpages
"https............"
"https............"
"https............"

I have the following code:

for (i in webpages){
    book_page <- read_html(link) 
}

I got this error Error: x must be a string of length 1,

may I know how could I resolve it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光磨忆 2025-01-17 16:24:07

for 循环不会按照问题标题的要求同时下载多个网站。但是，您可以使用并行化包，例如 pbmcapply：

library(rvest)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(pbmcapply)
#> Loading required package: parallel

webpages <- list(
  "http://example.com",
  "https://stackoverflow.com/",
  "https://github.com/"
)

# download 3 webpages at the same time
contents <- pbmclapply(webpages, read_file, mc.cores = 3)
contents_html <- lapply(contents, read_html)
contents_html[[1]]
#> {html_document}
#> <html>
#> [1] <head>\n<title>Example Domain</title>\n<meta charset="utf-8">\n<meta http ...
#> [2] <body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use ...

< super>由 reprex 创建于 2022-03-01 package (v2.0.1)

read_html 必须在主线程中执行，以避免指针错误。

A for loop does not download multiple website at the same time as required by the title of your question. However, you can use a parallelization package e.g. pbmcapply:

library(rvest)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:rvest':
#> 
#>     guess_encoding
library(pbmcapply)
#> Loading required package: parallel

webpages <- list(
  "http://example.com",
  "https://stackoverflow.com/",
  "https://github.com/"
)

# download 3 webpages at the same time
contents <- pbmclapply(webpages, read_file, mc.cores = 3)
contents_html <- lapply(contents, read_html)
contents_html[[1]]
#> {html_document}
#> <html>
#> [1] <head>\n<title>Example Domain</title>\n<meta charset="utf-8">\n<meta http ...
#> [2] <body>\n<div>\n    <h1>Example Domain</h1>\n    <p>This domain is for use ...

^{Created on 2022-03-01 by the reprex package (v2.0.1)}

read_html must be executed in the main thread to circumvent pointer errors.

回复收藏 0 原文

~没有更多了~