http错误403在请求多页（rvest）时

发布于 2025-02-04 15:16:39 字数 676 浏览 0 评论 0原文

我正在尝试编写一个浏览器，该刮板浏览了页面列表（全部来自同一站点），要么1。从每个页面下载html/css，或2。特定类。（目前，我的代码反映了前者。）我正在r中这样做。 Python在该网站的第一次请求时返回了403个错误，因此排除了Beautifulsoup和Selenium。在R中，我的代码工作了一段时间（相当短的代码），然后我收到一个403错误，特别是：

“ open.connection中的错误（x，“ rb”）：http错误403.”

我考虑在循环中的每个项目上放置一个sys.sleep（）计时器，但是我需要运行将近1000次，所以我发现该解决方案不切实际。我对做什么有些困惑，尤其是因为代码有效，但仅在停止之前就很短。我一直在研究代理/标题，但是我对其中的任何一个的了解都相当有限（尽管当然，我愿意学习是否有人建议其中的任何一个）。任何帮助都将被真诚地感谢。这是参考的代码：

for (i in 1:length(data1$Search)) {
    url = data1$Search[i]
    name = data1$Name[i]
    download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}

其中data1是带有“搜索”和“名称”列的两列数据框。再一次，任何建议都非常欢迎。谢谢。

原文

I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:

"Error in open.connection(x, "rb") : HTTP error 403."

I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:

for (i in 1:length(data1$Search)) {
    url = data1$Search[i]
    name = data1$Name[i]
    download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}

where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.

分享到QQ

分享到微博