从某些函数调用时 Haskell System.Timeout.timeout 崩溃
我正在从网站域列表的首页抓取一些数据。其中一些没有应答,或者速度非常慢,导致刮刀停止。
我想通过使用超时来解决这个问题。各种可用的 HTTP 库似乎不支持这一点,但 System.Timeout.timeout 似乎可以满足我的需要。
事实上,当我测试抓取功能时,它似乎工作正常,但是当我运行封闭函数时,它就会崩溃:(抱歉,代码很糟糕/丑陋。我正在学习。)
fetchPage domain =
-- Try to read the file from disk.
catch
(System.IO.Strict.readFile $ "page cache/" ++ domain)
(\e -> downloadAndCachePage domain)
downloadAndCachePage domain =
catch
(do
-- Failed, so try to download it.
-- This craches when called by fetchPage, but works fine when called from directly.
maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
let page = fromMaybe "" maybePage
-- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
-- page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
-- Cache it.
writeFile ("page cache/" ++ domain) page
return page)
(\e -> catch
(do
-- Failed, so just fuggeddaboudit.
writeFile ("page cache/" ++ domain) ""
return "")
(\e -> return "")) -- Failed BIG, so just don't give a crap.
downloadAndCachePage 在超时情况下工作正常,当从repl,但 fetchPage 崩溃了。如果我从 downloadAndCachePage 中删除超时,则 fetchPage 将起作用。
谁能解释一下这个问题,或者知道替代解决方案?
I'm scraping some data from the frontpages of a list of website domains. Some of them are not answering, or are very slow, causing the scraper to halt.
I wanted to solve this by using a timeout. The various HTTP libraries available don't seem to support that, but System.Timeout.timeout seems to do what I need.
Indeed, it seems to work fine when I test the scraping function, but it crashes as soon as I run the enclosing function: (Sorry for bad/ugly code. I'm learning.)
fetchPage domain =
-- Try to read the file from disk.
catch
(System.IO.Strict.readFile $ "page cache/" ++ domain)
(\e -> downloadAndCachePage domain)
downloadAndCachePage domain =
catch
(do
-- Failed, so try to download it.
-- This craches when called by fetchPage, but works fine when called from directly.
maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
let page = fromMaybe "" maybePage
-- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
-- page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
-- Cache it.
writeFile ("page cache/" ++ domain) page
return page)
(\e -> catch
(do
-- Failed, so just fuggeddaboudit.
writeFile ("page cache/" ++ domain) ""
return "")
(\e -> return "")) -- Failed BIG, so just don't give a crap.
downloadAndCachePage works fine with the timeout, when called from the repl, but fetchPage crashes. If I remove the timeout from downloadAndCachePage, fetchPage will work.
Anyone who can explain this, or know an alternative solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
fetchPage 中的 catch 处理程序看起来错误 - 似乎您正在尝试读取文件,并且文件未找到异常直接从异常处理程序调用您的 http 函数。不要这样做。我记得,由于复杂的原因,异常处理程序中的代码并不总是像正常代码一样运行 - 特别是当它尝试自行处理异常时。事实上,在幕后,超时使用异步异常来终止线程。
一般来说,您应该在异常处理程序中放置尽可能少的代码,尤其是不要放置尝试处理更多异常的代码(尽管重新引发已处理的异常以“传递它”通常是可以的[与
bracket 一样)
])。也就是说,即使您没有做正确的事情,也会发生崩溃(如果是段错误类型崩溃,而不是
<>
类型崩溃),甚至是来自奇怪的崩溃代码,几乎总是 GHC 的错误行为,如果您使用的是 GHC 7,那么您应该考虑报告此问题。Your catch handler in fetchPage looks wrong -- it seems you're trying to read a file, and on file not found exception are directly calling into your http function from the exception handler. Don't do this. For complicated reasons, as I recall, code in exception handlers doesn't always behave like normal code -- particularly when it attempts to handle exceptions itself. And indeed, under the covers, timeout uses asynchronous exceptions to kill threads.
In general, you should put as little code as possible in exception handlers, and especially not put code that tries to handle further exceptions (although it is generally fine to reraise a handled exception to "pass it on" [as with
bracket
]).That said, even if you're not doing the right thing, a crash (if it is a segfault type crash as opposed to a
<<loop>>
type crash), even from weird code, is nearly always wrong behavior from GHC, and if you're on GHC 7 then you should consider reporting this.