惰性 I/O 有什么不好呢?
我通常听说生产代码应该避免使用惰性 I/O。我的问题是,为什么?除了闲逛之外,还可以使用 Lazy I/O 吗?是什么让替代方案(例如枚举器)更好?
I've generally heard that production code should avoid using Lazy I/O. My question is, why? Is it ever OK to use Lazy I/O outside of just toying around? And what makes the alternatives (e.g. enumerators) better?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
惰性 IO 存在的问题是,释放您所获取的任何资源都有些不可预测,因为它取决于您的程序如何使用数据——它的“需求模式”。一旦您的程序删除了对该资源的最后一个引用,GC 最终将运行并释放该资源。
惰性流是一种非常方便的编程方式。这就是 shell 管道如此有趣和流行的原因。
但是,如果资源受到限制(例如在高性能场景中,或者期望扩展到机器极限的生产环境中),则依靠 GC 进行清理可能并不能保证充分。
有时你必须急切地释放资源,以提高可扩展性。
那么,有哪些替代惰性 IO 的方案,同时又不意味着放弃增量处理(这又会消耗太多资源)呢?好吧,我们有基于
foldl
的处理,又名迭代器或枚举器,由 Oleg Kiselyov 在后期引入2000 年代,此后被许多基于网络的项目所普及。我们不是将数据作为惰性流或在一个大批量中处理,而是抽象基于块的严格处理,并在读取最后一个块后保证资源的最终确定。这就是基于迭代的编程的本质,并且提供了非常好的资源约束。
基于 iteratee 的 IO 的缺点是它有一个有点尴尬的编程模型(大致类似于基于事件的编程,而不是基于线程的控制)。对于任何编程语言来说,这绝对是一项先进技术。而且对于绝大多数编程问题来说,惰性IO是完全令人满意的。但是,如果您要打开许多文件,或在许多套接字上进行通信,或以其他方式同时使用许多资源,则迭代器(或枚举器)方法可能会有意义。
Lazy IO has the problem that releasing whatever resource you have acquired is somewhat unpredictable, as it depends on how your program consumes the data -- its "demand pattern". Once your program drops the last reference to the resource, the GC will eventually run and release that resource.
Lazy streams are a very convenient style to program in. This is why shell pipes are so fun and popular.
However, if resources are constrained (as in high-performance scenarios, or production environments that expect to scale to the limits of the machine) relying on the GC to clean up can be an insufficient guarantee.
Sometimes you have to release resources eagerly, in order to improve scalability.
So what are the alternatives to lazy IO that don't mean giving up on incremental processing (which in turn would consume too many resources)? Well, we have
foldl
based processing, aka iteratees or enumerators, introduced by Oleg Kiselyov in the late 2000s, and since popularized by a number of networking-based projects.Instead of processing data as lazy streams, or in one huge batch, we instead abstract over chunk-based strict processing, with guaranteed finalization of the resource once the last chunk is read. That's the essence of iteratee-based programming, and one that offers very nice resource constraints.
The downside of iteratee-based IO is that it has a somewhat awkward programming model (roughly analogous to event-based programming, versus nice thread-based control). It is definitely an advanced technique, in any programming language. And for the vast majority of programming problems, lazy IO is entirely satisfactory. However, if you will be opening many files, or talking on many sockets, or otherwise using many simultaneous resources, an iteratee (or enumerator) approach might make sense.
Dons 提供了一个非常好的答案,但他遗漏了(对我来说)迭代器最引人注目的功能之一:它们使空间管理的推理变得更容易,因为旧数据必须显式保留。考虑一下:
这是一个众所周知的空间泄漏,因为整个列表
xs
必须保留在内存中才能计算sum
和length
。通过创建折叠可以成为高效的消费者:但是必须为每个流处理器执行此操作有点不方便。有一些概括(Conal Elliott - 美丽的折叠拉链),但它们似乎并不已经流行起来。然而,迭代者可以获得类似水平的表达。
这不像折叠那么有效,因为列表仍然会迭代多次,但是它是以块的形式收集的,因此可以有效地对旧数据进行垃圾收集。为了打破这个属性,有必要显式地保留整个输入,例如使用stream2list:
作为编程模型的迭代器的状态是一项正在进行的工作,但它比一年前要好得多。我们正在学习哪些组合器有用(例如
zip
、breakE
、enumWith
),哪些组合器不太有用,结果是内置迭代器和组合器不断提供更多的表现力。也就是说,唐斯是正确的,它们是一种先进的技术;我当然不会用它们来解决所有 I/O 问题。
Dons has provided a very good answer, but he's left out what is (for me) one of the most compelling features of iteratees: they make it easier to reason about space management because old data must be explicitly retained. Consider:
This is a well-known space leak, because the entire list
xs
must be retained in memory to calculate bothsum
andlength
. It's possible to make an efficient consumer by creating a fold:But it's somewhat inconvenient to have to do this for every stream processor. There are some generalizations (Conal Elliott - Beautiful Fold Zipping), but they don't seem to have caught on. However, iteratees can get you a similar level of expression.
This isn't as efficient as a fold because the list is still iterated over multiple times, however it's collected in chunks so old data can be efficiently garbage collected. In order to break that property, it's necessary to explicitly retain the entire input, such as with stream2list:
The state of iteratees as a programming model is a work in progress, however it's much better than even a year ago. We're learning what combinators are useful (e.g.
zip
,breakE
,enumWith
) and which are less so, with the result that built-in iteratees and combinators provide continually more expressivity.That said, Dons is correct that they're an advanced technique; I certainly wouldn't use them for every I/O problem.
我一直在生产代码中使用惰性 I/O。就像唐提到的那样,这只是在某些情况下才会出现的问题。但对于仅仅读取一些文件来说它工作得很好。
I use lazy I/O in production code all the time. It's only a problem in certain circumstances, like Don mentioned. But for just reading a few files it works fine.
更新: 最近在 haskell-cafe Oleg Kiseljov 显示 unsafeInterleaveST(用于在 ST monad 中实现惰性 IO)非常不安全 - 它破坏了等式推理。他表明它允许构造 bad_ctx :: ((Bool,Bool) -> Bool) ->布尔型
即使
==
是可交换的。惰性 IO 的另一个问题是:实际的 IO 操作可能会被推迟,直到为时已晚,例如在文件关闭之后。引用自 Haskell Wiki - 惰性 IO 问题:
这通常是意想不到的并且很容易犯的错误。
另请参阅:延迟 I/O 问题的三个示例。
Update: Recently on haskell-cafe Oleg Kiseljov showed that
unsafeInterleaveST
(which is used for implementing lazy IO within the ST monad) is very unsafe - it breaks equational reasoning. He shows that it allows to constructbad_ctx :: ((Bool,Bool) -> Bool) -> Bool
such that
even though
==
is commutative.Another problem with lazy IO: The actual IO operation can be deferred until it's too late, for example after the file is closed. Quoting from Haskell Wiki - Problems with lazy IO:
This is often unexpected and an easy-to-make error.
See also: Three examples of problems with Lazy I/O.
迄今为止尚未提及的惰性 IO 的另一个问题是它具有令人惊讶的行为。在普通的 Haskell 程序中,有时很难预测程序的每个部分何时被评估,但幸运的是,由于纯粹性,除非遇到性能问题,否则这并不重要。当引入惰性 IO 时,代码的求值顺序实际上会影响其含义,因此您习惯认为无害的更改可能会导致真正的问题。
举个例子,这里有一个关于代码的问题,看起来很合理,但由于延迟 IO 而变得更加混乱: withFile 与 openFile< /a>
这些问题并不总是致命的,但这是需要考虑的另一件事,并且是一个足够严重的头痛,我个人会避免惰性 IO,除非预先完成所有工作存在真正的问题。
Another problem with lazy IO that hasn't been mentioned so far is that it has surprising behaviour. In a normal Haskell program, it can sometimes be difficult to predict when each part of your program is evaluated, but fortunately due to purity it really doesn't matter unless you have performance problems. When lazy IO is introduced, the evaluation order of your code actually has an effect on its meaning, so changes that you're used to thinking of as harmless can cause you genuine problems.
As an example, here's a question about code that looks reasonable but is made more confusing by deferred IO: withFile vs. openFile
These problems aren't invariably fatal, but it's another thing to think about, and a sufficiently severe headache that I personally avoid lazy IO unless there's a real problem with doing all the work upfront.
惰性 I/O 的糟糕之处在于,作为程序员,您必须对某些资源进行微观管理,而不是对实现进行管理。例如,以下哪项是“不同的”?
freeSTRef :: STRef sa ->; ST s()
closeIORef::IORef a -> IO()
endMVar::MVar a -> IO()
discardTVar::TVar -> STM()
hClose::Handle -> IO()
finalizeForeignPtr::ForeignPtr a -> IO ()
...在所有这些不屑一顾的定义中,最后两个 -
hClose
和finalizeForeignPtr
- 实际上确实存在。至于其余的,他们可以用语言提供什么服务,由实现来更可靠地执行!因此,如果文件句柄和外部引用等资源的消除也留给实现,那么惰性 I/O 可能不会比惰性评估差。
What's so bad about lazy I/O is that you, the programmer, have to micro-manage certain resources instead of the implementation. For example, which of the following is "different"?
freeSTRef :: STRef s a -> ST s ()
closeIORef :: IORef a -> IO ()
endMVar :: MVar a -> IO ()
discardTVar :: TVar -> STM ()
hClose :: Handle -> IO ()
finalizeForeignPtr :: ForeignPtr a -> IO ()
...out of all these dismissive definitions, the last two -
hClose
andfinalizeForeignPtr
- actually do exist. As for the rest, what service they could provide in the language is much more reliably performed by the implementation!So if the dismissing of resources like file handles and foreign references was also left to the implementation, lazy I/O would probably be no worse than lazy evaluation.