从某些函数调用时 Haskell System.Timeout.timeout 崩溃

发布于 2024-10-28 04:52:19 字数 1427 浏览 4 评论 0原文

我正在从网站域列表的首页抓取一些数据。其中一些没有应答，或者速度非常慢，导致刮刀停止。

我想通过使用超时来解决这个问题。各种可用的 HTTP 库似乎不支持这一点，但 System.Timeout.timeout 似乎可以满足我的需要。

事实上，当我测试抓取功能时，它似乎工作正常，但是当我运行封闭函数时，它就会崩溃：（抱歉，代码很糟糕/丑陋。我正在学习。）

    fetchPage domain =
      -- Try to read the file from disk.
      catch
        (System.IO.Strict.readFile $ "page cache/" ++ domain)
        (\e -> downloadAndCachePage domain)


    downloadAndCachePage domain =
      catch
        (do
          -- Failed, so try to download it.

    -- This craches when called by fetchPage, but works fine when called from directly.
          maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
          let page = fromMaybe "" maybePage

    -- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
    --      page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)

          -- Cache it.
          writeFile ("page cache/" ++ domain) page
          return page)
        (\e -> catch
          (do
            -- Failed, so just fuggeddaboudit.
            writeFile ("page cache/" ++ domain) ""
            return "")
          (\e -> return "")) -- Failed BIG, so just don't give a crap.

downloadAndCachePage 在超时情况下工作正常，当从repl，但 fetchPage 崩溃了。如果我从 downloadAndCachePage 中删除超时，则 fetchPage 将起作用。

谁能解释一下这个问题，或者知道替代解决方案？

原文

I'm scraping some data from the frontpages of a list of website domains. Some of them are not answering, or are very slow, causing the scraper to halt.

I wanted to solve this by using a timeout. The various HTTP libraries available don't seem to support that, but System.Timeout.timeout seems to do what I need.

Indeed, it seems to work fine when I test the scraping function, but it crashes as soon as I run the enclosing function: (Sorry for bad/ugly code. I'm learning.)

    fetchPage domain =
      -- Try to read the file from disk.
      catch
        (System.IO.Strict.readFile $ "page cache/" ++ domain)
        (\e -> downloadAndCachePage domain)


    downloadAndCachePage domain =
      catch
        (do
          -- Failed, so try to download it.

    -- This craches when called by fetchPage, but works fine when called from directly.
          maybePage <- timeout 5000000 (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)
          let page = fromMaybe "" maybePage

    -- This mostly works, but wont timeout if the domain is slow. (lswb.com.cn)
    --      page <- (simpleHTTP (getRequest ("http://www." ++ domain)) >>= getResponseBody)

          -- Cache it.
          writeFile ("page cache/" ++ domain) page
          return page)
        (\e -> catch
          (do
            -- Failed, so just fuggeddaboudit.
            writeFile ("page cache/" ++ domain) ""
            return "")
          (\e -> return "")) -- Failed BIG, so just don't give a crap.

downloadAndCachePage works fine with the timeout, when called from the repl, but fetchPage crashes. If I remove the timeout from downloadAndCachePage, fetchPage will work.

Anyone who can explain this, or know an alternative solution?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄凉少年不暖心 2024-11-04 04:52:19

fetchPage 中的 catch 处理程序看起来错误 - 似乎您正在尝试读取文件，并且文件未找到异常直接从异常处理程序调用您的 http 函数。不要这样做。我记得，由于复杂的原因，异常处理程序中的代码并不总是像正常代码一样运行 - 特别是当它尝试自行处理异常时。事实上，在幕后，超时使用异步异常来终止线程。

一般来说，您应该在异常处理程序中放置尽可能少的代码，尤其是不要放置尝试处理更多异常的代码（尽管重新引发已处理的异常以“传递它”通常是可以的[与 bracket 一样）]）。

也就是说，即使您没有做正确的事情，也会发生崩溃（如果是段错误类型崩溃，而不是 <> 类型崩溃），甚至是来自奇怪的崩溃代码，几乎总是 GHC 的错误行为，如果您使用的是 GHC 7，那么您应该考虑报告此问题。

回复收藏 0 原文

~没有更多了~