如果 RCurl::getURL() 执行时间太长,如何停止执行?

发布于 2024-11-25 08:54:36 字数 1315 浏览 2 评论 0原文

有没有办法告诉 R 或 RCurl 包在超过指定时间段时放弃尝试下载网页并转到下一行代码?例如:

> library(RCurl)
> u = "http://photos.prnewswire.com/prnh/20110713/NY34814-b"
> getURL(u, followLocation = TRUE)
> print("next line") # programme does not get this far

这只会挂在我的系统上,不会继续到最后一行。

编辑: 根据@Richie Cotton 下面的回答,虽然我可以“某种程度上”实现我想要的,但我不明白为什么它需要比预期更长的时间。例如,如果我执行以下操作,系统将挂起,直到我选择/取消选择“其他>>” RGUI 中的“缓冲输出”选项:

> system.time(getURL(u, followLocation = TRUE, .opts = list(timeout = 1)))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  Operation timed out after 1000 milliseconds with 0 out of 0 bytes received
Timing stopped at: 0.02 0.08 ***6.76*** 

解决方案: 根据下面@Duncan的帖子,然后查看了curl文档,我通过使用maxredirs选项找到了解决方案,如下所示:

> getURL(u, followLocation = TRUE, .opts = list(timeout = 1, maxredirs = 2, verbose = TRUE))

谢谢你,

Tony Breyal

O/S: Windows 7
R version 2.13.0 (2011-04-13) Platform: x86_64-pc-mingw32/x64 (64-bit)
attached base packages: 
[1] stats     graphics  grDevices utils    
datasets  methods   base     
other attached packages: 
[1] RCurl_1.6-4.1  bitops_1.0-4.1
loaded via a namespace (and not attached): 
[1] tools_2.13.0

Is there a way to tell R or the RCurl package to give up on trying to download a webpage if it exceeds a specified period of time and move onto the next line of code? For example:

> library(RCurl)
> u = "http://photos.prnewswire.com/prnh/20110713/NY34814-b"
> getURL(u, followLocation = TRUE)
> print("next line") # programme does not get this far

This will just hang on my system and not proceed to the final line.

EDIT:
Based on @Richie Cotton's answer below, while I can 'sort of' achieve what I want, I don't understand why it takes longer than expected. For example, if I do the following, the system hangs until I select/unselect the 'Misc >> Buffered Output' option in RGUI:

> system.time(getURL(u, followLocation = TRUE, .opts = list(timeout = 1)))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
  Operation timed out after 1000 milliseconds with 0 out of 0 bytes received
Timing stopped at: 0.02 0.08 ***6.76*** 

SOLUTION:
Based on @Duncan's post below and then subsequently having a look at the curl docs, I found the solution by using the maxredirs option as follows:

> getURL(u, followLocation = TRUE, .opts = list(timeout = 1, maxredirs = 2, verbose = TRUE))

Thank you kindly,

Tony Breyal

O/S: Windows 7
R version 2.13.0 (2011-04-13) Platform: x86_64-pc-mingw32/x64 (64-bit)
attached base packages: 
[1] stats     graphics  grDevices utils    
datasets  methods   base     
other attached packages: 
[1] RCurl_1.6-4.1  bitops_1.0-4.1
loaded via a namespace (and not attached): 
[1] tools_2.13.0

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

毁梦 2024-12-02 08:54:36

我相信 Web 服务器正在陷入困境
一种混乱的状态,告诉我们该 URL 是暂时的
移动,然后它向我们指向一个新的 URL

http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fN\
Y34814%252db&action=details

当我们遵循它时,它再次将我们重定向到....相同的 URL!

所以超时不是问题。响应速度非常快,因此超时时间为
不超过。正是我们一圈又一圈的事实导致了明显的悬垂。

我发现这一点的方法是将 verbose = TRUE 添加到 .opts 列表中
然后我们就可以看到我们和服务器之间的所有通信。

D .

I believe that the Web server is getting itself into
a confused state by telling us that the URL is temporarily
moved and then it points us to a new URL

http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fN\
Y34814%252db&action=details

When we follow that, it redirects us again to .... the same URL!!!

So the timeout is not a problem. The response comes very quickly and so the timeout duration is
not exceed. It is the fact that we go round and round in circles that causes the apparent hang.

The way I found this is by adding verbose = TRUE to the list of .opts
Then we see all the communication between us and the server.

D.

清晨说晚安 2024-12-02 08:54:36

timeoutconnecttimeout 是curl选项,因此需要将它们以列表形式传递给.opts参数,以传递给getURL。不确定您需要两者中的哪一个,但从

getURL(u, followLocation = TRUE, .opts = list(timeout = 3))

编辑开始:

我可以重现挂起;更改缓冲输出并不能解决我的问题(在 R2.13.0 和 R2.13.1 下测试),并且无论有或没有超时参数都会发生这种情况。如果您在重定向目标页面上尝试 getURL,该页面将显示为空白。

u2 <- "http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fNY34814%252db&action=details"
getURL(u2)

如果删除 page 参数,它会将您重定向到登录页面;也许美通社在要求提供凭据方面做了一些有趣的事情。

u3 <- "http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&prnid=20110713%252fNY34814%252db&action=details"
getURL(u3)

timeout and connecttimeout are curl options, so they need to be passed in a list to the .opts paramter to getURL. Not sure which of the two that you need, but start with

getURL(u, followLocation = TRUE, .opts = list(timeout = 3))

EDIT:

I can reproduce the hang; changing buffered output doesn't fix it for me (tested under R2.13.0 and R2.13.1), and it happens with or without the timeout argument. If you try getURL on the page that is the target of the redirect, it appears blank.

u2 <- "http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&page=/getStoryRemapDetails.do&prnid=20110713%252fNY34814%252db&action=details"
getURL(u2)

If you remove the page argument, it redirects you to a login page; maybe PR Newswire is doing something funny with asking for credentials.

u3 <- "http://photos.prnewswire.com/medias/switch.do?prefix=/appnb&prnid=20110713%252fNY34814%252db&action=details"
getURL(u3)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文