Spark兼容狭窄数据的数据质量框架

发布于 2025-01-26 01:11:23 字数 803 浏览 4 评论 0原文

我正在尝试在a 格式。

像这样

传感器时间戳
A1225112
B12262
A看起来图片
十亿

数据 值。

我正在为这些数据构建数据质量监视,以检查对值的一些期望(例如,值是否属于给定传感器的预期范围内,有成千上万的期望)。由于数据的大小和现有基础架构,因此必须在Spark上运行解决方案。我想在(理想的开源)数据质量框架上构建该解决方案,但找不到任何合适的东西。

我研究了巨大的期望和Dequ,但是从根本上讲,这些似乎是为“广泛的数据”建立的,其中定义了列的期望。我可以从理论上重塑(枢轴)我的数据到这种格式,但是这将是一个非常昂贵的操作,并且会导致一个极端稀疏的表可尴尬地使用(或者需要在时间上进行采样,并且以这种方式丢失了信息) 。

有人知道以狭窄格式的这种时间序列数据现有的(火花兼容)框架吗?还是可以将我指向最佳实践,如何在这种环境中运用Dequ/Great Trucking?

I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format.

Image billions of rows of data that look kinda like this:

SensorTimestampValue
A1225112
B12262"A"
A1226113
A1227113
C122735.4545

There are hundreds of thousands of sensors, but for each timestamp only a very small percentage send values.

I'm building Data Quality Monitoring for this data that checks some expectations about the values (e.g. whether the value falls within the expected range for a given sensor, there are tens of thousands of different expectations). Due to the size of the data and existing infrastructure the solution has to be run on Spark. I would like to build this solution on an (ideally open source) data quality framework, but cannot find anything appropriate.

I've looked into Great Expectations and Deequ, but these fundamentally seem to be build for "wide data" where the expectations are defined for columns. I could theoretically reshape (pivot) my data to this format, but it would be a very expensive operation and result in an extremly sparse table that is awkward to work with (or require sampling on the time and in this way a loss of information).

Does anyone know of an existing (spark compatible) framework for such time series data in narrow format? Or can point me to best practices how to apply Deequ/Great Expectations in such a setting?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

几味少女 2025-02-02 01:11:23

您是否尝试过github.com/canimus/cuallee>
这是一个开源框架,它支持观察API对Pydeequ进行数十亿记录,超级快速和资源较少的贪婪进行测试。是直观的,易于使用。

Have you tried github.com/canimus/cuallee
It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文