如何在我的抓取循环中包含 user_agent
我试图从同一卖家那里抓取多种产品的价格,但我无法通过 R 读取 html(错误 403)。经过一些研究,我发现您可以通过使用 httr
包设置用户代理来解决这个问题。
但现在,当我想在循环中抓取多个产品网站时,我不确定如何将 GET
函数和 user_agent
集成到我的循环中。 到目前为止,我的代码如下所示:(
for (j in input_deindeal$`Deindeal Artikel`) {
Sys.sleep(runif(1, min=0.25, max=0.5))
i<-i+1
vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
read_html %>%
html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
}
正确的 html_element
和 html_text
也尚未设置,这可能是一个进一步的问题) j 指的是网上商店中产品的文章 ID,例如 16030981 和 16030983。因此链接如下所示: https://www.deindeal.ch/de/product/16030981 和 https://www.deindeal.ch/de/product/16030983
编辑:到目前为止,我尝试过这个但没有成功: (错误消息:parse_url(url) 中的错误:length(url) == 1 不为 TRUE
)
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindeal$`Deindeal Artikel`) {
Sys.sleep(runif(1, min=0.25, max=0.5))
i<-i+1
vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>%
read_html %>%
html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
}
I'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr
package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET
function and the user_agent
into my loop.
So far my code looks like this:
for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr
package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET
function and the user_agent
into my loop.
So far my code looks like this:
Deindeal Artikel`) {
Sys.sleep(runif(1, min=0.25, max=0.5))
i<-i+1
vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
read_html %>%
html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
}
(The correct html_element
and html_text
are also not set yet, that will probably be a further problem)
j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983
Edit: So far, I tried this but without success:
(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE
)
ua <- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr
package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET
function and the user_agent
into my loop.
So far my code looks like this:
for (j in input_deindealI'm trying to scrape prices for multiple products from the same seller, but I wasn't able to read the html through R (error 403). After some research I found out you can surpass this problem by setting a user agent using the httr
package.
But now as i want to scrape multiple product sites in a loop, i'm not sure how to integrate the GET
function and the user_agent
into my loop.
So far my code looks like this:
Deindeal Artikel`) {
Sys.sleep(runif(1, min=0.25, max=0.5))
i<-i+1
vec_deindeal[i] <- try( paste0('https://www.deindeal.ch/de/product/',j)%>%
read_html %>%
html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
}
(The correct html_element
and html_text
are also not set yet, that will probably be a further problem)
j refers to Article ID's from the products on the webshop, e.g. 16030981 and 16030983. So the links look like this: https://www.deindeal.ch/de/product/16030981 and https://www.deindeal.ch/de/product/16030983
Edit: So far, I tried this but without success:
(Error message: Error in parse_url(url) : length(url) == 1 is not TRUE
)
Deindeal Artikel`) {
Sys.sleep(runif(1, min=0.25, max=0.5))
i<-i+1
vec_deindeal[i] <- try( GET( paste0('https://www.deindeal.ch/de/product/',j,user_agent(ua)))%>%
read_html %>%
html_element('#QuantitySelectorLayout_QuantitySelectorLayout [id$=price]') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的
)
位置错误。您在
paste0()
中有user_agent(ua)
,但它应该在paste0()
外部作为GET()< 中的第二个参数/code>
我使用空格来显示它:
或者
或者
我使用页面 https://httpbin.org/get测试它
没有
user_agent()
它显示"libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"
和
user_agent()
显示“Mozilla/5.0”
You have
)
in wrong place.You have
user_agent(ua)
insidepaste0()
but it should be outsidepaste0()
as second argument inGET()
I use spaces to show it:
or
or
I use page https://httpbin.org/get to test it
without
user_agent()
it shows"libcurl/7.68.0 r-curl/4.3.2 httr/1.4.2"
with
user_agent()
it shows"Mozilla/5.0"