当源查看列更改时,如何防止DBT重建增量表?

发布于 2025-01-27 15:38:21 字数 804 浏览 2 评论 0原文

我们在DBT中具有以下结构:

[events (view), with column1, column2, column3]‒‒‒>[incremental1, using column1 from events]
                                             |
                                             |
                                             └‒‒‒‒>[incremental2, using column2 from events]

在我们的连续集成设置中,我们运行dbt运行 - 模型状态:修改+,以避免运行在PR中未更改的模型。但是,问题在于,如果我们在中更改column3events(例如重命名IT),则两个增量表,crememental1/code>将进行完全重建,这可能需要花费太长的时间,有时甚至一整天。请注意,由于不以任何方式使用更改的列,因此这两个增量表实际上都会受到其源的更改的影响。

有没有办法避免重建这些模型,同时也不必运行所有其他模型(这是简单的dbt Run会做的)?

一些其他详细信息:

  • 我们的DBT版本是1.0,
  • 我们的数据仓库引擎是雪花,
  • 我们的增量策略是'delete+insert'

We have the following structure in dbt:

[events (view), with column1, column2, column3]‒‒‒>[incremental1, using column1 from events]
                                             |
                                             |
                                             └‒‒‒‒>[incremental2, using column2 from events]

In our continuous integration setup, we run dbt run --models state:modified+ to avoid running models that were not changed in a PR. However, the problem is that if we change column3 in events (e.g. rename it), both incremental tables, incremental1 and incremental2, will be fully rebuilt which may take an excessively long time, sometimes more than a whole day. Note that neither incremental tables are actually affected by the change in their source as they do not use the changed column in any way.

Is there a way to avoid rebuilding these models while also not having to run all the other models (which is what a simple dbt run would do)?

Some additional details:

  • our dbt version is 1.0
  • our data warehouse engine is Snowflake
  • our incremental strategy is 'delete+insert'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

蛮可爱 2025-02-03 15:38:21

dbt Run采用- Dublude选项;您可以按名称排除那些长期运行的桌子,也可以标记它们。显然,这承担了自己的风险和维护负担。

另一个选择是在开发或CI运行中过滤这些大表中的记录,以便它们更快地构建。您可以将类似的内容添加到增量模型中:

{% if target.name != 'prod' %}
where updated_at > current_timestamp() - interval '30 days'
{% endif %}

dbt run takes an --exclude option; you could exclude those long-running tables by name, or tag them. This obviously carries its own risks and maintenance burden.

Another option would be to filter the records in these large tables on a dev or CI run so that they build much more quickly. You could add something like this to the incremental model:

{% if target.name != 'prod' %}
where updated_at > current_timestamp() - interval '30 days'
{% endif %}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文