如何上传包含DBT NEWLINE的CSV数据
我有一个第三方生成的CSV文件,我希望使用dbt seed
上传到Google BigQuery。
我设法将其手动上传到BigQuery,但是我需要启用默认情况下的“引用Newlines”。
当我运行dbt seed
时,我会收到以下错误:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
CSV中有32列。该文件包含带有新线的列值。我想这就是DBT解析器失败的地方。我检查了 dbt种子配置选项,但我没有发现任何相关的内容。
有什么想法吗?
I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed
.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed
, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知 - 种子功能受到 dbt-core 内置功能的很大限制。所以种子不是我来这里的方式。您可以查看扩展种子选项的请求历史记录 here dbt-cre issues repo(包括我自己对类似可选性的请求#3990 ),但我必须看到这方面有任何真正的吸引力。
也就是说,对我来说非常有效的是将 gcp 项目中的平面文件存储在 gcs 存储桶中,然后利用 dbt-external-tables 包,用于非常相似但更强大的文件结构。我知道管理这个可能会产生很大的开销,但如果您的种子文件继续以可以利用分区的方式扩展,那么它就变得非常值得。
更重要的是 - 正如 Jeremy 在 stackoverflow 上的这个答案中提到的,
对于您的情况,应该是
quote
或allowQuotedNewlines
选项。如果您确实选择使用dbt-external-tables
,您的source.yml将类似于:gcs.yml
或非常类似的东西。
如果您最终选择此路径并将任务数据存储在日常分区上,例如
tasks_2022_04_16.csv
- 您可以访问该文件名和其他元数据,所提供的伪列也由 Jeremy 在此处与我共享:<一个href="https://stackoverflow.com/questions/66143582/retrieve-filename-from-gcp-storage-during-dbt-external-tables-sideload/66153656#66153656">在 dbt 期间从 GCP 存储中检索“文件名”-外部表侧载?
我发现它是一套非常强大的文件工具特别是 BigQuery。
As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
Which for your case, should be either the
quote
orallowQuotedNewlines
options. If you did choose to usedbt-external-tables
your source.yml for this would look something like:gcs.yml
Or something very similar.
And if you end up taking this path and storing task data on a daily partition like,
tasks_2022_04_16.csv
- you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.