RCurl,错误:连接超时
我使用 R 的 XML 和 RCurl 包从网站获取数据。 该脚本需要废弃 6,000,000 个页面,因此我创建了一个循环。
for (page in c(1:6000000)){
my_url = paste('http://webpage.....')
page1 <- getURL(my_url, encoding="UTF-8")
mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")
.....
.....
.....}
但是,经过几次循环后,我收到错误消息:
curlPerform 中出现错误(curl = curl, .opts = opts, .encoding = .encoding) : 连接超时
问题是我不明白“超时”是如何工作的。有时该过程在 700 页后结束,而有时在 1000、1200 页等后结束。脚步不稳定。 当连接超时时,我在 15 分钟内无法从笔记本电脑访问此网页。 我曾想过使用一个命令,每报废 1000 页,将进程延迟 15 分钟
if(page==1000) Sys.sleep(901)
,但没有任何改变。
有什么想法出了什么问题以及如何克服这个问题吗?
I use the XML and RCurl packages of R to get data from a website.
The script needs to scrap 6,000,000 pages, so I created a loop.
for (page in c(1:6000000)){
my_url = paste('http://webpage.....')
page1 <- getURL(my_url, encoding="UTF-8")
mydata <- htmlParse(page1, asText=TRUE, encoding="UTF-8")
title <- xpathSApply(mydata, '//head/title', xmlValue, simplify = TRUE, encoding="UTF-8")
.....
.....
.....}
However, after a few loops I get the error message:
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding)
: connection time out
The problem is that I don't understand how the "time out" works. Sometimes the process ends after 700 pages while other times after 1000, 1200 etc pages. The step is not stable.
When the connection is timed out, I can't access this webpage from my laptop, for 15 minutes.
I thought of using a command to delay the process for 15 minutes every 1000 pages scrapped
if(page==1000) Sys.sleep(901)
, but nothing changed.
Any ideas what is going wrong and how to overcome this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用命令
System()
在 R 中调用curl
的本机安装。这样,您就可以访问RCurl
当前不支持的所有curl
选项,例如--retry
。选项--retry
将导致发出的curl
查询在每次失败后以更长的时间重复重试,即在第一次失败后重试 1 秒,第二次失败后 2 秒,第三次失败后 4 秒,依此类推。其他时间控制选项也可在 cURL 站点 http://curl.haxx.se/docs/ manpage.html。You could make a call in R to a native installation of
curl
using the commandSystem()
. This way, you get access to all thecurl
options not currently supported byRCurl
such as--retry <num>
. The option--retry <num>
will cause an issuedcurl
query to repeatedly try again at ever greater lengths of time after each failure, i.e. retry 1 second after first failure, 2 seconds after second failure, 4 seconds after third failure, and so on. Other time control options are also available at the cURL site http://curl.haxx.se/docs/manpage.html.我解决了。刚刚将
Sys.sleep(1)
添加到每次迭代中。I solved it. Just added
Sys.sleep(1)
to each iteration.