Spark兼容狭窄数据的数据质量框架
我正在尝试在a 格式。
像这样
传感器 | 时间戳 | 值 |
---|---|---|
A | 12251 | 12 |
B | 12262 | ” |
A | 看起来 | 图片 |
“ | 的 | 数 |
: | 十亿 | 行 |
数据 值。
我正在为这些数据构建数据质量监视,以检查对值的一些期望(例如,值是否属于给定传感器的预期范围内,有成千上万的期望)。由于数据的大小和现有基础架构,因此必须在Spark上运行解决方案。我想在(理想的开源)数据质量框架上构建该解决方案,但找不到任何合适的东西。
我研究了巨大的期望和Dequ,但是从根本上讲,这些似乎是为“广泛的数据”建立的,其中定义了列的期望。我可以从理论上重塑(枢轴)我的数据到这种格式,但是这将是一个非常昂贵的操作,并且会导致一个极端稀疏的表可尴尬地使用(或者需要在时间上进行采样,并且以这种方式丢失了信息) 。
有人知道以狭窄格式的这种时间序列数据现有的(火花兼容)框架吗?还是可以将我指向最佳实践,如何在这种环境中运用Dequ/Great Trucking?
I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format.
Image billions of rows of data that look kinda like this:
Sensor | Timestamp | Value |
---|---|---|
A | 12251 | 12 |
B | 12262 | "A" |
A | 12261 | 13 |
A | 12271 | 13 |
C | 12273 | 5.4545 |
There are hundreds of thousands of sensors, but for each timestamp only a very small percentage send values.
I'm building Data Quality Monitoring for this data that checks some expectations about the values (e.g. whether the value falls within the expected range for a given sensor, there are tens of thousands of different expectations). Due to the size of the data and existing infrastructure the solution has to be run on Spark. I would like to build this solution on an (ideally open source) data quality framework, but cannot find anything appropriate.
I've looked into Great Expectations and Deequ, but these fundamentally seem to be build for "wide data" where the expectations are defined for columns. I could theoretically reshape (pivot) my data to this format, but it would be a very expensive operation and result in an extremly sparse table that is awkward to work with (or require sampling on the time and in this way a loss of information).
Does anyone know of an existing (spark compatible) framework for such time series data in narrow format? Or can point me to best practices how to apply Deequ/Great Expectations in such a setting?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您是否尝试过
github.com/canimus/cuallee
>这是一个开源框架,它支持观察API对Pydeequ进行数十亿记录,超级快速和资源较少的贪婪进行测试。是直观的,易于使用。
Have you tried
github.com/canimus/cuallee
It is an open-source framework, that supports the Observation API to make testing on billions of records, super-fast, and less resource greedy as pydeequ. Is intuitive, and easy to use.