http错误403在请求多页(rvest)时

发布于 2025-02-04 15:16:39 字数 676 浏览 0 评论 0原文

我正在尝试编写一个浏览器,该刮板浏览了页面列表(全部来自同一站点),要么1。从每个页面下载html/css,或2。特定类。 (目前,我的代码反映了前者。)我正在r中这样做。 Python在该网站的第一次请求时返回了403个错误,因此排除了Beautifulsoup和Selenium。在R中,我的代码工作了一段时间(相当短的代码),然后我收到一个403错误,特别是:

“ open.connection中的错误(x,“ rb”):http错误403.”

我考虑在循环中的每个项目上放置一个sys.sleep()计时器,但是我需要运行将近1000次,所以我发现该解决方案不切实际。我对做什么有些困惑,尤其是因为代码 有效,但仅在停止之前就很短。我一直在研究代理/标题,但是我对其中的任何一个的了解都相当有限(尽管当然,我愿意学习是否有人建议其中的任何一个)。任何帮助都将被真诚地感谢。这是参考的代码:

for (i in 1:length(data1$Search)) {
    url = data1$Search[i]
    name = data1$Name[i]
    download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}

其中data1是带有“搜索”和“名称”列的两列数据框。再一次,任何建议都非常欢迎。谢谢。

I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:

"Error in open.connection(x, "rb") : HTTP error 403."

I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:

for (i in 1:length(data1$Search)) {
    url = data1$Search[i]
    name = data1$Name[i]
    download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}

where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文