Haskell 中稀疏缺失数据的高效处理
我正在尝试使用 Haskell 进行数据分析。由于我的数据集相当大(数十万甚至可能数百万个观测值),因此我理想情况下希望使用未装箱的数据结构来提高效率,例如 Data.Vector.Unboxed。
问题在于数据包含一些缺失值。我想避免将它们编码为“99”或类似的代码,因为这只是一个丑陋的黑客行为和潜在的错误来源。从我的 Haskell 新手的角度来看,我可以想到以下选项:
- 未打包的
Maybe
值的盒装向量。类似于(如有错误请纠正):数据 myMaybe a = Nothing |只是 {-# UNPACK #-} !a
- 一个未装箱的(可拆箱的)元组向量,其中一个布尔元素指示缺失:
newtype 实例 Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
这可能与此问题的OP选择的方法相同 (对Bool
取模Int
),但唯一的答案似乎并没有明确解决这个问题缺失值/稀疏性(而不是关注如何表示未装箱的整个数组,而不是作为未装箱向量的装箱向量)。 - 未装箱向量的元组,一个包含值,另一个包含要注入缺失值的索引,或者非缺失值的游程长度,或者一些等效信息。这可能比选项 2 更好。如果缺失很少?
我试图保持在矢量表示中,而不是像 this 这样的东西,因为稀疏的是缺失值,而不是数据。
欢迎对这些选项的相对优点/可行性/现成可用性/可能的性能发表任何评论,或者实际上指出完全不同的替代方案!
编辑:
- 有人指出,答案可能取决于我打算对数据执行哪种操作。目前,将每个观测值存储在单个向量中似乎比每个变量更方便。由于向量中的条目将引用不同的变量,因此不太可能进行类似“折叠”的操作。
- 我猜测 2. 如果合适的话,会在内部自动存储“有效位”向量 à la 3.,所以 3. 可以被删除吗?
I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.
The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:
- A boxed vector of unpacked
Maybe
values. Something like (please correct if wrong):data myMaybe a = Nothing | Just {-# UNPACK #-} !a
- An unboxed vector of (unboxable) tuples, whith a boolean element indicating missingness:
newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
This may be the same approach as chosen by the OP of this question (moduloInt
forBool
), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors). - A tuple of unboxed vectors, one with the values, the other with the indices at which missing values are to be injected, or the run lengths of non-missing values, or some equivalent information. This might be preferable to option 2. if missings are sparse?
I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.
Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!
Edit:
- It's been pointed out that the answer potentially depends on what kind of operations I intend to perform on the data. At the moment, it seems more convenient to store each observation in a single vector, rather than each variable. Since the entries in the vector will therefore refer to different variables, "fold"-like ops are unlikely.
- I'm guessing 2. will internally store the "valid bit" vector à la 3. automatically if appropriate, so 3. could be dropped?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会选择选项 3,但你不应该使用向量来存储丢失的索引:这会给你
O(nMissing)
查找时间,除非丢失的数据是 极其稀疏。Data.IntMap
应该可以很好地完成这项工作,然后您可以轻松地使用member
函数来检查索引是否指向缺失的观察。哈希表甚至更好,但可能不是必需的。I'd go with option 3, but you should not use a vector to store the missing-indizes: that gives you
O(nMissing)
lookup time, which is unreasonably slow unless the missing data is extremely sparse.Data.IntMap
should do the job well, you can then easily use themember
function to check if an index points to a missing observation. Hash tables are even better but probably not necessary.