如何在我的抓取循环中包含 user_agent

发布于 2025-01-13 12:52:55 字数 1845 浏览 6 评论 0原文

我试图从同一卖家那里抓取多种产品的价格，但我无法通过 R 读取 html（错误 403）。经过一些研究，我发现您可以通过使用 httr 包设置用户代理来解决这个问题。

但现在，当我想在循环中抓取多个产品网站时，我不确定如何将 GET 函数和 user_agent 集成到我的循环中。到目前为止，我的代码如下所示：（

for (j in input_deindeal$`Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

正确的 html_element 和 html_text 也尚未设置，这可能是一个进一步的问题） j 指的是网上商店中产品的文章 ID，例如 16030981 和 16030983。因此链接如下所示： https://www.deindeal.ch/de/product/16030981 和 https://www.deindeal.ch/de/product/16030983

编辑：到目前为止，我尝试过这个但没有成功：（错误消息：parse_url(url) 中的错误：length(url) == 1 不为 TRUE）

ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindeal$`Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

原文

I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.

But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.
So far my code looks like this:

for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.

So far my code looks like this:
Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

(The correct html_element and html_text are also not set yet, that will probably be a further problem)
j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983

Edit: So far, I tried this but without success:
(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE)

ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.

So far my code looks like this:
for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET function and the user_agent into my loop.

So far my code looks like this:
Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

(The correct html_element and html_text are also not set yet, that will probably be a further problem)

j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983
Edit: So far, I tried this but without success:

(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE)
Deindeal Artikel`) {
  Sys.sleep(runif(1, min=0.25, max=0.5))
  i<-i+1
  vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>%
                          read_html %>%
                          html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
                          html_text()%>%
                          str_extract("[0-9]+") %>%
                          as.integer())
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半枫 2025-01-20 12:52:55

您的 ) 位置错误。

您在 paste0() 中有 user_agent(ua)，但它应该在 paste0() 外部作为 GET()< 中的第二个参数/code>

我使用空格来显示它：

GET(  paste0('https://www.deindeal.ch/de/product/',j),  user_agent(ua)  )

或者

url <- paste0('https://www.deindeal.ch/de/product/',j) 

GET(  url,  user_agent(ua)  )

或者

paste0('https://www.deindeal.ch/de/product/',j) %>% GET( user_agent(ua) )

我使用页面 https://httpbin.org/get测试它

没有user_agent() 它显示 "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"

> GET( paste0('https://httpbin.org/', 'get') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:54
  Status: 200
  Content-Type: application/json
  Size: 373 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2", 
    "X-Amzn-Trace-Id": "Root=1-622a3b72-00b48f9c2c15da155db2e723"
  }, 
  "origin": "79.163.206.131", 
...

和 user_agent()显示“Mozilla/5.0”

> GET( paste0('https://httpbin.org/', 'get'), user_agent('Mozilla/5.0') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:52
  Status: 200
  Content-Type: application/json
  Size: 346 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0", 
    "X-Amzn-Trace-Id": "Root=1-622a3ac6-41ace39e5aa032da3e312de9"
  }, 
  "origin": "79.163.206.131", 
...

You have ) in wrong place.

You have user_agent(ua) inside paste0() but it should be outside paste0() as second argument in GET()

I use spaces to show it:

GET(  paste0('https://www.deindeal.ch/de/product/',j),  user_agent(ua)  )

url <- paste0('https://www.deindeal.ch/de/product/',j) 

GET(  url,  user_agent(ua)  )

paste0('https://www.deindeal.ch/de/product/',j) %>% GET( user_agent(ua) )

I use page https://httpbin.org/get to test it

without user_agent() it shows "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"

> GET( paste0('https://httpbin.org/', 'get') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:54
  Status: 200
  Content-Type: application/json
  Size: 373 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2", 
    "X-Amzn-Trace-Id": "Root=1-622a3b72-00b48f9c2c15da155db2e723"
  }, 
  "origin": "79.163.206.131", 
...

with user_agent() it shows "Mozilla/5.0"

> GET( paste0('https://httpbin.org/', 'get'), user_agent('Mozilla/5.0') )

Response [https://httpbin.org/get]
  Date: 2022-03-10 17:52
  Status: 200
  Content-Type: application/json
  Size: 346 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "deflate, gzip, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0", 
    "X-Amzn-Trace-Id": "Root=1-622a3ac6-41ace39e5aa032da3e312de9"
  }, 
  "origin": "79.163.206.131", 
...

回复收藏 0 原文

~没有更多了~