在 DBT 中自动创建 sql 文件

发布于 2025-01-10 18:37:59 字数 620 浏览 3 评论 0原文

嘿，我是 DBT 的初学者，我正在尝试创建暂存层。我已经创建了表架构，现在想要自动化创建模型的过程，因为我必须创建超过 100 个 SQL 模型。

我的模式就像（xyz.yml）：

- schema: xyz

  tables:    
    - name: abc
      loaded_at_field: updated_at
      freshness:
        warn_after: {count: 24 , period: hour}
        filter: updated_at>current_date-7
    - name: def
      loaded_at_field: updated_at
      freshness:
        warn_after: {count: 24 , period: hour}
        filter: updated_at>current_date-7   
    - name: ghi    
    - name: jkl       
    - name: mno

将为每个表名生成 SQL 模型。我有超过 100 个表，希望为 dbt 中的所有表名自动创建暂存模型 SQL 文件（如 xyz_abc.sql）。

原文

Hey I am a beginner in DBT and I am trying to create the staging layer. I have created the table schema and now want to automate the process of creating models as I have to create over 100+ sql models.

My schema is like (xyz.yml):

- schema: xyz

  tables:    
    - name: abc
      loaded_at_field: updated_at
      freshness:
        warn_after: {count: 24 , period: hour}
        filter: updated_at>current_date-7
    - name: def
      loaded_at_field: updated_at
      freshness:
        warn_after: {count: 24 , period: hour}
        filter: updated_at>current_date-7   
    - name: ghi    
    - name: jkl       
    - name: mno

SQL models will be generated for each of the table name. I have more than 100 tables and would like to create the staging model SQL file automatically (like xyz_abc.sql) for all the table name in dbt.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

还不是爱你 2025-01-17 18:37:59

您可以为此编写一个小 python 脚本（或任何其他与此相关的工具）。

我的意思是，这很大程度上依赖于知道如何编写 python...

代码应该类似于：

import yaml

with open('{YOUR PATH TO THE SCHEMA FILE}/schema.yaml','r') as file:
    schema = yaml.safe_load(file)
    
names = [table['name'] for table in schema['tables']]

for name in names:
   with open(f'{name}.sql', 'w') as file:
       file.write(f'SELECT * FROM {name}')

注意：我没有测试此代码

You could write a little python script for this (or any other tool for that matter).

I mean, that has the big dependency of knowing how to write python...

The code should be something like:

import yaml

with open('{YOUR PATH TO THE SCHEMA FILE}/schema.yaml','r') as file:
    schema = yaml.safe_load(file)
    
names = [table['name'] for table in schema['tables']]

for name in names:
   with open(f'{name}.sql', 'w') as file:
       file.write(f'SELECT * FROM {name}')

NOTE: I did not test this code

回复收藏 0 原文

云归处 2025-01-17 18:37:59

你可以像上面提到的@MYK一样编写一个python脚本。但还有一个用于创建 schema.yml 和模型文件的 dbt 包，

请查看此包。
https://github.com/dbt-labs/dbt-codegen

你可能仍然想要使用一些 python 来清理输出并向 codegen 动态提供表名称。

回复收藏 0 原文

眼角的笑意。 2025-01-17 18:37:59

总的来说，我对生成 100 个模型的想法持批评态度。
原因很简单：除非您只需要读取这 100 个表的数据并“按原样”公开它们，否则您将需要对它们应用一些业务逻辑。

如果您属于第一种情况...为什么您需要 dbt？

如果您要应用一些业务逻辑...编写代码是最耗时的操作：如果您尝试具体化数据并保存更改，您需要知道主键，如果您想组合来自多个系统的数据您需要了解业务键并拥有映射表以及如何应用主数据管理的一些想法...编写可以生成的代码是最不重要的问题。

如果您的项目有 100 个表，这可不是一件简单的工作，并且假设您需要使用所有 100 个表，您将需要了解它们并在它们上编写业务规则。
在这种情况下，自动模型生成将节省在每个表上花费的一小部分时间......那么为什么还要麻烦呢？

IMO 最好有一些东西可以帮助您节省繁重的工作，但是您编写每个模型，以便确保应用正确的模式。

另外，我确实更喜欢仅在需要时添加表，使用 dbt codegen 包之类的东西，或者，如果您有要使用的可重复模式，则使用 INFORMATION_SCHEMA 中的 COLUMNS 视图自行编写的 SQL 查询来为您提供表您在应用 PATTERN 的模板中容纳的特定值。

像下面这样的查询已经在很大程度上为您提供了表的功能，以便您可以更改您不喜欢的名称，并以最小的努力应用最终的转换或其他硬业务规则：

SELECT ', ' || COLUMN_NAME || ' as '|| COLUMN_NAME || ' -- ' || DATA_TYPE as SQL_TEXT
FROM <db>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'XXX' and TABLE_NAME = 'YYY'
ORDER BY ORDINAL_POSITION;

然后您在以下位置添加一个模型：当您真正需要它时（YAGNI 原则），而无需从某些数据源“加载所有表”开始。

PS 您不需要重复相同的新鲜度 SLA 定义 100 次。
您可以在源系统级别声明一次，然后覆盖特定表的不同参数。
首先在容易的地方节省复杂性；）

In general I am quite critical of the idea of generating 100 models.
The reason is simple: unless you just need to read the data of these 100 tables and expose them "as is", you will need to apply some business logic on them.

If you are in the first case... why do you need dbt at all?

If you are going to apply some business logic... writing the code is the least time consuming operation: if you are trying to materialize the data and save changes you need to know the primary key, if you want to combine data from multiple system you need to know the business keys and have mapping tables and some idea of how to apply master data management... writing code that you can generate is the least of the problems.

If you have a project with 100 tables that is no trivial work and, assuming that you need to use all the 100 tables, you will need to understand them and write business rules on them.
In this context the automatic model generation would save a tiny fraction of the time spent on each of the tables... so why bother?

IMO much better having something that saves you the grunt work, but you write each model so you are sure to apply the right pattern.

Also, I do prefer adding tables only when needed, using something like the dbt codegen package or, if you have a repeatable pattern that you want to use, a self written SQL query that uses the COLUMNS view from the INFORMATION_SCHEMA to provide you the table specific values that you accomodate in the template that applies the PATTERN.

A query like the following already goes a long way to give you the beef of the table so that you can change the names you do not like, and apply eventual casts or other hard business rules with minimal effort:

SELECT ', ' || COLUMN_NAME || ' as '|| COLUMN_NAME || ' -- ' || DATA_TYPE as SQL_TEXT
FROM <db>.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'XXX' and TABLE_NAME = 'YYY'
ORDER BY ORDINAL_POSITION;

And then you add one model at a time when you actually need it (YAGNI principle) without starting by "loading all tables" from some datasource.

PS You do not need to repeat the same freshness SLA definition 100 times.
You can declare it once at source system level and just override the parameter that are different for a specific table.
Start by saving complexity where it is easy ;)

回复收藏 0 原文

~没有更多了~