当前位置：文江博客话题详情

Haskell 中稀疏缺失数据的高效处理

发布于 2024-12-15 12:03:11 字数 1122 浏览 6 评论 0原文

我正在尝试使用 Haskell 进行数据分析。由于我的数据集相当大（数十万甚至可能数百万个观测值），因此我理想情况下希望使用未装箱的数据结构来提高效率，例如 Data.Vector.Unboxed。

问题在于数据包含一些缺失值。我想避免将它们编码为“99”或类似的代码，因为这只是一个丑陋的黑客行为和潜在的错误来源。从我的 Haskell 新手的角度来看，我可以想到以下选项：

未打包的 Maybe 值的盒装向量。类似于（如有错误请纠正）：
数据 myMaybe a = Nothing |只是 {-# UNPACK #-} !a
一个未装箱的（可拆箱的）元组向量，其中一个布尔元素指示缺失：
newtype 实例 Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
这可能与此问题的OP选择的方法相同（对 Bool 取模 Int），但唯一的答案似乎并没有明确解决这个问题缺失值/稀疏性（而不是关注如何表示未装箱的整个数组，而不是作为未装箱向量的装箱向量）。
未装箱向量的元组，一个包含值，另一个包含要注入缺失值的索引，或者非缺失值的游程长度，或者一些等效信息。这可能比选项 2 更好。如果缺失很少？

我试图保持在矢量表示中，而不是像 this 这样的东西，因为稀疏的是缺失值，而不是数据。

欢迎对这些选项的相对优点/可行性/现成可用性/可能的性能发表任何评论，或者实际上指出完全不同的替代方案！

编辑：

有人指出，答案可能取决于我打算对数据执行哪种操作。目前，将每个观测值存储在单个向量中似乎比每个变量更方便。由于向量中的条目将引用不同的变量，因此不太可能进行类似“折叠”的操作。
我猜测 2. 如果合适的话，会在内部自动存储“有效位”向量 à la 3.，所以 3. 可以被删除吗？

原文

I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.

The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:

A boxed vector of unpacked Maybe values. Something like (please correct if wrong):
data myMaybe a = Nothing | Just {-# UNPACK #-} !a
An unboxed vector of (unboxable) tuples, whith a boolean element indicating missingness:
newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
This may be the same approach as chosen by the OP of this question (modulo Int for Bool), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors).
A tuple of unboxed vectors, one with the values, the other with the indices at which missing values are to be injected, or the run lengths of non-missing values, or some equivalent information. This might be preferable to option 2. if missings are sparse?

I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.

Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!

Edit:

It's been pointed out that the answer potentially depends on what kind of operations I intend to perform on the data. At the moment, it seems more convenient to store each observation in a single vector, rather than each variable. Since the entries in the vector will therefore refer to different variables, "fold"-like ops are unlikely.
I'm guessing 2. will internally store the "valid bit" vector à la 3. automatically if appropriate, so 3. could be dropped?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

神仙妹妹 2024-12-22 12:03:11

我会选择选项 3，但你不应该使用向量来存储丢失的索引：这会给你 O(nMissing) 查找时间，除非丢失的数据是 极其稀疏。 Data.IntMap 应该可以很好地完成这项工作，然后您可以轻松地使用 member 函数来检查索引是否指向缺失的观察。哈希表甚至更好，但可能不是必需的。