在 Haskell 中管理有状态计算系统

发布于 2024-11-26 14:42:12 字数 619 浏览 1 评论 0原文

因此，我有一个链接在一起的有状态处理器系统。例如，处理器可能会输出最后 10 个输入的平均值。它需要状态来计算这个平均值。

我想向系统提交值并获取输出。我也想跳回来，恢复过去随时的状态。 IE。我通过系统运行了 1000 个值。现在我想将系统“移”回到发送第 500 个值后的状态。然后我想从那时起再次“重播”系统。

我还需要能够将历史状态保存到磁盘，以便我可以在将来的某个时间再次恢复整个事情（并且仍然可以回退和重播功能）。当然，我需要使用千兆字节的数据来完成此操作，并且速度非常快:)

我一直在使用闭包来保持状态来接近它。但我想知道使用 monad 是否更有意义。我只读过 3 或 4 个 monad 的类比，所以还不太理解它们，所以请随意教育我。

如果每个处理器以保留其历史记录的方式修改其在 monad 中的状态，并且将其与每个处理步骤的 id 绑定在一起。然后，monad 能够以某种方式将其状态切换到过去的步骤 ID，并在该状态下以 monad 运行系统。并且 monad 将具有某种机制来将自身序列化（反序列化）以进行存储。

（考虑到数据的大小......它实际上不应该全部都在内存中，这意味着 monad 需要映射到磁盘、缓存等......）

是否有现有的库/机制/已经完成或协助完成我想做的事情的方法/概念？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心舞飞扬 2024-12-03 14:42:12

所以，我有一个链接在一起的有状态处理器系统。例如，处理器可能会输出最后 10 个输入的平均值。它需要状态来计算这个平均值。

首先，听起来您所拥有的不仅仅是“有状态处理器”，而是类似于有限状态机和/或传感器。这可能是一个很好的研究起点。

我想向系统提交值并获取输出。我也想跳回来，恢复过去随时的状态。 IE。我通过系统运行了 1000 个值。现在我想将系统“移”回到发送第 500 个值后的状态。然后我想再次从该点“重播”系统。

当然，这里最简单的方法是简单地保留所有先前状态的日志。但由于听起来您拥有大量数据，因此所需的存储很容易变得令人望而却步。我建议考虑如何以可以避免这种情况的方式构建处理器，例如：

如果可以从前面几步的邻居状态轻松重建处理器的状态，则可以避免直接记录它
如果处理器是在某些情况下很容易逆转，您不需要立即记录这些情况；可以直接计算倒带，并且可以将日志记录作为定期快照来完成。
如果您可以将处理器固定为非常少量的状态，请务必这样做。
如果处理器在某些类型的输入上以非常可预测的方式运行，您可以这样记录——例如，如果它在低于某个截止值的数字输入上空闲，而不是记录每个值，只是记录“空闲 N 个步骤”。

我还需要能够将历史状态保存到磁盘，以便我可以在将来的某个时间再次恢复整个内容（并且仍然可以回退和重播功能）。当然，我需要使用千兆字节的数据来完成此操作，并且速度非常快:)

显式状态是您的朋友。函数是表示活动状态机的便捷方法，但它们不能以任何简单的方式序列化。您希望将（基本上静态的）处理器网络与每个处理器用于计算下一步的一系列内部状态完全分离。

是否已经有一个现有的库/机制/方法/概念来完成我想要做的事情？ monad 方法有意义吗？是否有其他更好/特殊的方法可以帮助它有效地完成此操作，特别是考虑到我必须管理大量数据？

如果您的大多数处理器类似于有限状态传感器，并且您需要拥有接受各种类型的输入并产生不同类型的输出的处理器，那么您可能想要的实际上是具有基于 Arrows 的结构，它为您提供了组成事物的抽象从某种意义上说，“类似功能”，例如，将一个处理器的输入连接到另一处理器的输出。

此外，只要您避免使用 ArrowApply 类并确保您的状态机模型仅返回输出值和新状态，就可以保证避免隐式状态，因为（与函数不同）< code>箭头不会自动更高阶。

考虑到数据的大小...它实际上不应该全部都在内存中，这意味着 monad 需要映射到磁盘、缓存等...

考虑到处理器网络的静态表示，它应该提供一个增量输入/输出系统来读取数据、序列化/反序列化状态以及写入任何输出也不是太困难。

作为一个快速、粗略的起点，下面是我上面概述的可能是最简单版本的示例，暂时忽略日志记录问题：

data Transducer s a b = Transducer { curState :: s
                                   , runStep  :: s -> a -> (s, b)
                                   }

runTransducer :: Transducer s a b -> [a] -> [b]
runTransducer t [] = (t, [])
runTransducer t (x:xs) = let (s, y) = runStep t (curState t) x
                             (t', ys) = runTransducer (t { curState = s }) xs
                         in (t', y:ys)

它是一个简单的通用处理器，具有 s，输入类型为 a，输出类型为 b。 runTransducer 函数推送输入列表，手动更新状态值，并收集输出列表。

PS——既然你问的是单子，你可能想知道我给出的例子是否是单子。事实上，它是多个常见单子的组合，尽管具体是哪一个取决于你如何看待它。但是，我故意避免将其视为单子！问题是，单子仅捕获在某种意义上非常强大的抽象，但同样的能力也使它们在某些方面更能抵抗优化和静态分析。需要排除的主要问题是采用其他处理器作为输入并运行它们的处理器，这（正如您可以想象的）可以创建几乎无法分析的复杂逻辑。

因此，虽然处理器可能是单子，并且在某种逻辑意义上本质上是单子，但假装它们不是单子可能更有用；施加人为限制以使静态分析更简单。

So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.

First of all, it sounds like what you have are not just "stateful processors" but something like finite-state machines and/or transducers. This is probably a good place to start for research.

I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.

The simplest approach here, of course, is to simply keep a log of all prior states. But since it sounds like you have a great deal of data, the storage needed could easily become prohibitive. I would recommend thinking about how you might construct your processors in a way that could avoid this, e.g.:

If a processor's state can be reconstructed easily from the states of its neighbors a few steps prior, you can avoid logging it directly
If a processor is easily reversible in some situations, you don't need to log those immediately; rewinding can be calculated directly, and logging can be done as periodic snapshots
If you can nail a processor down to a very small number of states, make sure to do so.
If a processor behaves in very predictable ways on certain kinds of input, you can record that as such--e.g., if it idles on numeric input below some cutoff, rather than logging each value just log "idled for N steps".

I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)

Explicit state is your friend. Functions are a convenient way to represent active state machines, but they can't be serialized in any simple way. You want a clean separation of a (basically static) network of processors vs. a series of internal states used by each processor to calculate the next step.

Is there an existing library/mechanism/approach/concept that has already been done to accomplish what I'm trying to do? Does the monad approach make sense? Are there other better/special approaches that would help it do this efficiently especially given the enormous amount of data I have to manage?

If most of your processors resemble finite state transducers, and you need to have processors that take inputs of various types and produce different types of outputs, what you probably want is actually something with a structure based on Arrows, which gives you an abstraction for things that compose "like functions" in some sense, e.g., connecting the input of one processor to the output of another.

Furthermore, as long as you avoid the ArrowApply class and make sure that your state machine model only returns an output value and a new state, you'll be guaranteed to avoid implicit state because (unlike functions) Arrows aren't automatically higher-order.

Given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...

Given a static representation of your processor network, it shouldn't be too difficult to also provide an incremental input/output system that would read the data, serialize/deserialize the state, and write any output.

As a quick, rough starting point, here's an example of probably the simplest version of what I've outlined above, ignoring the logging issue for the moment:

data Transducer s a b = Transducer { curState :: s
                                   , runStep  :: s -> a -> (s, b)
                                   }

runTransducer :: Transducer s a b -> [a] -> [b]
runTransducer t [] = (t, [])
runTransducer t (x:xs) = let (s, y) = runStep t (curState t) x
                             (t', ys) = runTransducer (t { curState = s }) xs
                         in (t', y:ys)

It's a simple, generic processor with explicit internal state of type s, input of type a, and output of type b. The runTransducer function shoves a list of inputs through, updating the state value manually, and collects a list of outputs.

P.S. -- since you were asking about monads, you might want to know if the example I gave is one. In fact, it's a combination of multiple common monads, though which ones depends on how you look at it. However, I've deliberately avoided treating it as a monad! The thing is, monads capture only abstractions that are in some sense very powerful, but that same power also makes them more resistant in some ways to optimization and static analysis. The main thing that needs to be ruled out is processors that take other processors as input and run them, which (as you can imagine) can create convoluted logic that's nearly impossible to analyze.

So, while the processors probably could be monads, and in some logical sense intrinsically are, it may be more useful to pretend that they aren't; imposing an artificial limitation in order to make static analysis simpler.

回复收藏 0 原文

~没有更多了~