r编程下载.file()返回403禁止错误

发布于 2025-02-06 01:16:58 字数 679 浏览 0 评论 0原文

以前一直在删除网页,现在它正在返回 403禁止错误。当我通过浏览器手动访问网站时,我没有问题,但是当我刮擦页面时,我会遇到错误。

代码是:

url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")

错误是:

Error in download.file(url, destfile = "webpage.html", quiet = TRUE) : 
  cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
  cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'

我已经查看了文档,并尝试在网上找到答案,但到目前为止没有运气。有什么建议,我如何绕过这个?

Been scraping a webpage previously and it is now returning a 403 Forbidden error. When I visit the site manually through a browser I have no problems, however when I scrape the page now I get the error.

Code is:

url <- 'https://www.punters.com.au/form-guide/'
download.file(url, destfile = "webpage.html", quiet=TRUE)
html <- read_html("webpage.html")

Error is:

Error in download.file(url, destfile = "webpage.html", quiet = TRUE) : 
  cannot open URL 'https://www.punters.com.au/form-guide/'
In addition: Warning message:
In download.file(url, destfile = "webpage.html", quiet = TRUE) :
  cannot open URL 'https://www.punters.com.au/form-guide/': HTTP status was '403 Forbidden'

I've looked at the documentation and tried finding an answer online but had no luck so far. Any suggestions how I can circumvent this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

长安忆 2025-02-13 01:16:58

看起来他们添加了用户代理验证。您需要添加用户代理并有效。
如果您不放置某些浏览器的用户代理,则该网站会认为您是机器人并阻止您的。在这里,您有一些Python代码。

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)

用户代理要求在R中请求:

require(httr)

headers = c(
  `user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)

res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))

Looks like they added user-agent validation. You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you. Here you have some python code.

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.punters.com.au/form-guide/"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(baseurl, headers=headers).content
soup = BeautifulSoup(page, 'html.parser')
title = soup.find("div", class_="short_title")
print("Title: " +title.text)

Request in R with user-agent:

require(httr)

headers = c(
  `user-agent` = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.61 Safari/537.36'
)

res <- httr::GET(url = 'https://www.punters.com.au/form-guide/', httr::add_headers(.headers=headers))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文