r编程下载.file（）返回403禁止错误

发布于 2025-02-06 01:16:58 字数 679 浏览 0 评论 0原文

以前一直在删除网页，现在它正在返回 403禁止错误。当我通过浏览器手动访问网站时，我没有问题，但是当我刮擦页面时，我会遇到错误。

代码是：

url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")

错误是：

Error in download.file(url, destfile = "webpage.html", quiet = TRUE) : 
  cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
  cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'

我已经查看了文档，并尝试在网上找到答案，但到目前为止没有运气。有什么建议，我如何绕过这个？

原文

Been scraping a webpage previously and it is now returning a 403 Forbidden error. When I visit the site manually through a browser I have no problems, however when I scrape the page now I get the error.

Code is:

url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")

Error is:

Error in download.file(url, destfile = "webpage.html", quiet = TRUE) : 
  cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
  cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'

I've looked at the documentation and tried finding an answer online but had no luck so far. Any suggestions how I can circumvent this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长安忆 2025-02-13 01:16:58

看起来他们添加了用户代理验证。您需要添加用户代理并有效。
如果您不放置某些浏览器的用户代理，则该网站会认为您是机器人并阻止您的。在这里，您有一些Python代码。

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)

用户代理要求在R中请求：

require(httr)

headers = c(
  `user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)

res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))

Looks like they added user-agent validation. You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you. Here you have some python code.

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)

Request in R with user-agent:

require(httr)

headers = c(
  `user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)

res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))

回复收藏 0 原文

~没有更多了~