如何在我的抓取循环中包含 user_agent

发布于 2025-01-13 12:52:55 字数 1845 浏览 6 评论 0原文

我试图从同一卖家那里抓取多种产品的价格,但我无法通过 R 读取 html(错误 403)。经过一些研究,我发现您可以通过使用 httr 包设置用户代理来解决这个问题。

但现在,当我想在循环中抓取多个产品网站时,我不确定如何将 GET 函数和 user_agent 集成到我的循环中。 到目前为止,我的代码如下所示:(

for (j in input_deindeal$`Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

正确的 html_elementhtml_text 也尚未设置,这可能是一个进一步的问题) j 指的是网上商店中产品的文章 ID,例如 16030981 和 16030983。因此链接如下所示: https://www.deindeal.ch/de/product/16030981https://www.deindeal.ch/de/product/16030983

编辑:到目前为止,我尝试过这个但没有成功: (错误消息:parse_url(url) 中的错误:length(url) == 1 不为 TRUE

ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindeal$`Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.

But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.
So far my code looks like this:

for (j in input_deindeal

I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.

But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.
So far my code looks like this:

Deindeal Artikel`) { Sys.sleep(runif(1, min=0.25, max=0.5)) i<-i+1 vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>% read_html %>% html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>% html_text()%>% str_extract("[0-9]+") %>% as.integer()) }

(The correct html_element and html_text are also not set yet, that will probably be a further problem)
j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983

Edit: So far, I tried this but without success:
(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE)

ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindeal

I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.

But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.
So far my code looks like this:

for (j in input_deindeal

I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.

But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.
So far my code looks like this:

Deindeal Artikel`) { Sys.sleep(runif(1, min=0.25, max=0.5)) i<-i+1 vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>% read_html %>% html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>% html_text()%>% str_extract("[0-9]+") %>% as.integer()) }

(The correct html_element and html_text are also not set yet, that will probably be a further problem)
j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983

Edit: So far, I tried this but without success:
(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE)

Deindeal Artikel`) { Sys.sleep(runif(1, min=0.25, max=0.5)) i<-i+1 vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>% read_html %>% html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>% html_text()%>% str_extract("[0-9]+") %>% as.integer()) }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半枫 2025-01-20 12:52:55

您的 ) 位置错误。

您在 paste0() 中有 user_agent(ua),但它应该在 paste0() 外部作为 GET()< 中的第二个参数/code>

我使用空格来显示它:

GET(  paste0('https://www.deindeal.ch/de/product/',j),  user_agent(ua)  )

或者

url <- paste0('https://www.deindeal.ch/de/product/',j) 

GET(  url,  user_agent(ua)  )

或者

paste0('https://www.deindeal.ch/de/product/',j) %>% GET( user_agent(ua) )

我使用页面 https://httpbin.org/get测试它

没有user_agent() 它显示 "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"

> GET( paste0('https://httpbin.org/', 'get') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:54
  Status: 200
  Content-Type: application/json
  Size: 373 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2", 
    "X-Amzn-Trace-Id": "Root=1-622a3b72-00b48f9c2c15da155db2e723"
  }, 
  "origin": "79.163.206.131", 
...

user_agent()显示“Mozilla/5.0”

> GET( paste0('https://httpbin.org/', 'get'), user_agent('Mozilla/5.0') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:52
  Status: 200
  Content-Type: application/json
  Size: 346 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0", 
    "X-Amzn-Trace-Id": "Root=1-622a3ac6-41ace39e5aa032da3e312de9"
  }, 
  "origin": "79.163.206.131", 
...

You have ) in wrong place.

You have user_agent(ua) inside paste0() but it should be outside paste0() as second argument in GET()

I use spaces to show it:

GET(  paste0('https://www.deindeal.ch/de/product/',j),  user_agent(ua)  )

or

url <- paste0('https://www.deindeal.ch/de/product/',j) 

GET(  url,  user_agent(ua)  )

or

paste0('https://www.deindeal.ch/de/product/',j) %>% GET( user_agent(ua) )

I use page https://httpbin.org/get to test it

without user_agent() it shows "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"

> GET( paste0('https://httpbin.org/', 'get') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:54
  Status: 200
  Content-Type: application/json
  Size: 373 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2", 
    "X-Amzn-Trace-Id": "Root=1-622a3b72-00b48f9c2c15da155db2e723"
  }, 
  "origin": "79.163.206.131", 
...

with user_agent() it shows "Mozilla/5.0"

> GET( paste0('https://httpbin.org/', 'get'), user_agent('Mozilla/5.0') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:52
  Status: 200
  Content-Type: application/json
  Size: 346 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0", 
    "X-Amzn-Trace-Id": "Root=1-622a3ac6-41ace39e5aa032da3e312de9"
  }, 
  "origin": "79.163.206.131", 
...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文