如果 RCurl::getURL() 执行时间太长,如何停止执行?
有没有办法告诉 R 或 RCurl 包在超过指定时间段时放弃尝试下载网页并转到下一行代码?例如:
> library(RCurl)
> u = "http://photos.prnewswire.com/prnh/20110713/NY34814-b"
> getURL(u, followLocation = TRUE)
> print("next line") # programme does not get this far
这只会挂在我的系统上,不会继续到最后一行。
编辑: 根据@Richie Cotton 下面的回答,虽然我可以“某种程度上”实现我想要的,但我不明白为什么它需要比预期更长的时间。例如,如果我执行以下操作,系统将挂起,直到我选择/取消选择“其他>>” RGUI 中的“缓冲输出”选项:
> system.time(getURL(u, followLocation = TRUE, .opts = list(timeout = 1)))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
Operation timed out after 1000 milliseconds with 0 out of 0 bytes received
Timing stopped at: 0.02 0.08 ***6.76***
解决方案: 根据下面@Duncan的帖子,然后查看了curl文档,我通过使用maxredirs选项找到了解决方案,如下所示:
> getURL(u, followLocation = TRUE, .opts = list(timeout = 1, maxredirs = 2, verbose = TRUE))
谢谢你,
Tony Breyal
O/S: Windows 7
R version 2.13.0 (2011-04-13) Platform: x86_64-pc-mingw32/x64 (64-bit)
attached base packages:
[1] stats graphics grDevices utils
datasets methods base
other attached packages:
[1] RCurl_1.6-4.1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] tools_2.13.0
Is there a way to tell R or the RCurl package to give up on trying to download a webpage if it exceeds a specified period of time and move onto the next line of code? For example:
> library(RCurl)
> u = "http://photos.prnewswire.com/prnh/20110713/NY34814-b"
> getURL(u, followLocation = TRUE)
> print("next line") # programme does not get this far
This will just hang on my system and not proceed to the final line.
EDIT:
Based on @Richie Cotton's answer below, while I can 'sort of' achieve what I want, I don't understand why it takes longer than expected. For example, if I do the following, the system hangs until I select/unselect the 'Misc >> Buffered Output' option in RGUI:
> system.time(getURL(u, followLocation = TRUE, .opts = list(timeout = 1)))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
Operation timed out after 1000 milliseconds with 0 out of 0 bytes received
Timing stopped at: 0.02 0.08 ***6.76***
SOLUTION:
Based on @Duncan's post below and then subsequently having a look at the curl docs, I found the solution by using the maxredirs option as follows:
> getURL(u, followLocation = TRUE, .opts = list(timeout = 1, maxredirs = 2, verbose = TRUE))
Thank you kindly,
Tony Breyal
O/S: Windows 7
R version 2.13.0 (2011-04-13) Platform: x86_64-pc-mingw32/x64 (64-bit)
attached base packages:
[1] stats graphics grDevices utils
datasets methods base
other attached packages:
[1] RCurl_1.6-4.1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] tools_2.13.0
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我相信 Web 服务器正在陷入困境
一种混乱的状态,告诉我们该 URL 是暂时的
移动,然后它向我们指向一个新的 URL
http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fN\
Y34814%252db&action=details
当我们遵循它时,它再次将我们重定向到....相同的 URL!
所以超时不是问题。响应速度非常快,因此超时时间为
不超过。正是我们一圈又一圈的事实导致了明显的悬垂。
我发现这一点的方法是将 verbose = TRUE 添加到 .opts 列表中
然后我们就可以看到我们和服务器之间的所有通信。
D .
I believe that the Web server is getting itself into
a confused state by telling us that the URL is temporarily
moved and then it points us to a new URL
http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fN\
Y34814%252db&action=details
When we follow that, it redirects us again to .... the same URL!!!
So the timeout is not a problem. The response comes very quickly and so the timeout duration is
not exceed. It is the fact that we go round and round in circles that causes the apparent hang.
The way I found this is by adding verbose = TRUE to the list of .opts
Then we see all the communication between us and the server.
D.
timeout
和connecttimeout
是curl选项,因此需要将它们以列表形式传递给.opts
参数,以传递给getURL
。不确定您需要两者中的哪一个,但从编辑开始:
我可以重现挂起;更改缓冲输出并不能解决我的问题(在 R2.13.0 和 R2.13.1 下测试),并且无论有或没有超时参数都会发生这种情况。如果您在重定向目标页面上尝试
getURL
,该页面将显示为空白。如果删除
page
参数,它会将您重定向到登录页面;也许美通社在要求提供凭据方面做了一些有趣的事情。timeout
andconnecttimeout
are curl options, so they need to be passed in a list to the.opts
paramter togetURL
. Not sure which of the two that you need, but start withEDIT:
I can reproduce the hang; changing buffered output doesn't fix it for me (tested under R2.13.0 and R2.13.1), and it happens with or without the timeout argument. If you try
getURL
on the page that is the target of the redirect, it appears blank.If you remove the
page
argument, it redirects you to a login page; maybe PR Newswire is doing something funny with asking for credentials.