RCurl，错误：连接超时

发布于 2024-12-07 07:01:02 字数 770 浏览 6 评论 0原文

我使用 R 的 XML 和 RCurl 包从网站获取数据。该脚本需要废弃 6,000,000 个页面，因此我创建了一个循环。

for (page in c(1:6000000)){

 my_url = paste('http://webpage.....')
 page1 <- getURL(my_url, encoding="UTF-8")
 mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
 title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")

.....
.....
.....}

但是，经过几次循环后，我收到错误消息：

curlPerform 中出现错误（curl = curl, .opts = opts, .encoding = .encoding） : 连接超时

问题是我不明白“超时”是如何工作的。有时该过程在 700 页后结束，而有时在 1000、1200 页等后结束。脚步不稳定。当连接超时时，我在 15 分钟内无法从笔记本电脑访问此网页。我曾想过使用一个命令，每报废 1000 页，将进程延迟 15 分钟

if(page==1000) Sys.sleep(901)

，但没有任何改变。

有什么想法出了什么问题以及如何克服这个问题吗？

原文

I use the XML and RCurl packages of R to get data from a website.
The script needs to scrap 6,000,000 pages, so I created a loop.

for (page in c(1:6000000)){

 my_url = paste('http://webpage.....')
 page1 <- getURL(my_url, encoding="UTF-8")
 mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
 title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")

.....
.....
.....}

However, after a few loops I get the error message:

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding)
: connection time out

The problem is that I don't understand how the "time out" works. Sometimes the process ends after 700 pages while other times after 1000, 1200 etc pages. The step is not stable.
When the connection is timed out, I can't access this webpage from my laptop, for 15 minutes.
I thought of using a command to delay the process for 15 minutes every 1000 pages scrapped

if(page==1000) Sys.sleep(901)

, but nothing changed.

Any ideas what is going wrong and how to overcome this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

木落 2024-12-14 07:01:02

您可以使用命令 System() 在 R 中调用 curl 的本机安装。这样，您就可以访问 RCurl 当前不支持的所有 curl 选项，例如 --retry。选项 --retry 将导致发出的 curl 查询在每次失败后以更长的时间重复重试，即在第一次失败后重试 1 秒，第二次失败后 2 秒，第三次失败后 4 秒，依此类推。其他时间控制选项也可在 cURL 站点 http://curl.haxx.se/docs/ manpage.html。