如何验证数据集市中的数据是否正确?

发布于 2024-08-24 09:23:11 字数 261 浏览 1 评论 0原文

我正在开发一个数据仓库,我正在尝试找出如何最好地验证数据清理(标准化)数据库中的数据是否正确进入我们的数据集市。我已经做了一些搜索,但到目前为止的结果更多地讨论了确保诸如约束之类的事情到位,以及您需要在 ETL 过程中进行数据验证(例如日期有效等)。维度非常简单,因为我可以轻松地利用主键或编写一个非常简单且可验证的查询来获取数据。事实表更加复杂。

有什么想法吗?我们试图使主题导出变得非常容易,以运行几个查询,查看来自数据清理数据库和数据集市的一些数据,并直观地比较两者以确保它们是正确的。

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.

Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无声情话 2024-08-31 09:23:11

您可以通过在其他地方实现相同数据操作的简化、精简子集并比较结果来测试事实表负载。

您至少计算两次相同的总计、计数或其他数字。一次来自事实表本身,在完成加载后,一次来自某个其他源:

  • 直接源数据,控制源和已知
  • 源系统报告 。
  • 记录

之间的所有清理步骤如果您在数据库中执行此操作,则可以将每个测试编写为查询,如果一切正确,则不返回任何 返回的任何记录都是例外:(y,z) 的 x 计数不匹配。

请参阅 ConcernedOfTunbridgeWells 发表的这篇精彩的帖子,了解更多建议。

You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.

You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:

  • the source data directly, controlling for all the scrubbing steps in between source and fact
  • a source system report that is known to be correct
  • etc.

If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.

See this excellent post by ConcernedOfTunbridgeWells for more recommendations.

游魂 2024-08-31 09:23:11

尽管如果您进行大量清理或转换,它有一些缺点和潜在问题,但我发现您可以通过从星型模式重新生成输入文件来往返输入文件。然后简单地将输入文件与输出文件进行比较。可能需要一些按摩才能使它们匹配(一个是左填充,另一个是右填充)。

通常,我有一个程序,它使用与 ETL 相同的布局,并进行比较,忽略字段内的对齐。另外,文件可能需要排序 - 我使用了命令行排序。

如果您的 ETL 错误地进行了转换并且您错误地进行了转换,则此方法仍然有可能无法显示 DW 中的所有问题,并且我不会声称它具有完整的覆盖范围,但它是回归的一个非常好的第一击每个负载的单元测试。

Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).

Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.

If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文