http错误403在请求多页(rvest)时
我正在尝试编写一个浏览器,该刮板浏览了页面列表(全部来自同一站点),要么1。从每个页面下载html/css,或2。特定类。 (目前,我的代码反映了前者。)我正在r中这样做。 Python在该网站的第一次请求时返回了403个错误,因此排除了Beautifulsoup和Selenium。在R中,我的代码工作了一段时间(相当短的代码),然后我收到一个403错误,特别是:
“ open.connection中的错误(x,“ rb”):http错误403.”
我考虑在循环中的每个项目上放置一个sys.sleep()计时器,但是我需要运行将近1000次,所以我发现该解决方案不切实际。我对做什么有些困惑,尤其是因为代码 有效,但仅在停止之前就很短。我一直在研究代理/标题,但是我对其中的任何一个的了解都相当有限(尽管当然,我愿意学习是否有人建议其中的任何一个)。任何帮助都将被真诚地感谢。这是参考的代码:
for (i in 1:length(data1$Search)) {
url = data1$Search[i]
name = data1$Name[i]
download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}
其中data1是带有“搜索”和“名称”列的两列数据框。再一次,任何建议都非常欢迎。谢谢。
I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:
"Error in open.connection(x, "rb") : HTTP error 403."
I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:
for (i in 1:length(data1$Search)) {
url = data1$Search[i]
name = data1$Name[i]
download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}
where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论