Spark兼容狭窄数据的数据质量框架

发布于 2025-01-26 01:11:23 字数 803 浏览 4 评论 0原文

我正在尝试在a 格式。

像这样

传感器	时间戳	值
A	12251	12
B	12262	”
A	看起来	图片
“	的	数
：	十亿	行

数据值。

我正在为这些数据构建数据质量监视，以检查对值的一些期望（例如，值是否属于给定传感器的预期范围内，有成千上万的期望）。由于数据的大小和现有基础架构，因此必须在Spark上运行解决方案。我想在（理想的开源）数据质量框架上构建该解决方案，但找不到任何合适的东西。

我研究了巨大的期望和Dequ，但是从根本上讲，这些似乎是为“广泛的数据”建立的，其中定义了列的期望。我可以从理论上重塑（枢轴）我的数据到这种格式，但是这将是一个非常昂贵的操作，并且会导致一个极端稀疏的表可尴尬地使用（或者需要在时间上进行采样，并且以这种方式丢失了信息）。

有人知道以狭窄格式的这种时间序列数据现有的（火花兼容）框架吗？还是可以将我指向最佳实践，如何在这种环境中运用Dequ/Great Trucking？

原文

I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format.

Image billions of rows of data that look kinda like this:

Sensor	Timestamp	Value
A	12251	12
B	12262	"A"
A	12261	13
A	12271	13
C	12273	5.4545

There are hundreds of thousands of sensors, but for each timestamp only a very small percentage send values.

I'm building Data Quality Monitoring for this data that checks some expectations about the values (e.g. whether the value falls within the expected range for a given sensor, there are tens of thousands of different expectations). Due to the size of the data and existing infrastructure the solution has to be run on Spark. I would like to build this solution on an (ideally open source) data quality framework, but cannot find anything appropriate.

I've looked into Great Expectations and Deequ, but these fundamentally seem to be build for "wide data" where the expectations are defined for columns. I could theoretically reshape (pivot) my data to this format, but it would be a very expensive operation and result in an extremly sparse table that is awkward to work with (or require sampling on the time and in this way a loss of information).

Does anyone know of an existing (spark compatible) framework for such time series data in narrow format? Or can point me to best practices how to apply Deequ/Great Expectations in such a setting?

分享到QQ

分享到微博