如何上传包含DBT NEWLINE的CSV数据

发布于 2025-01-20 15:08:58 字数 618 浏览 4 评论 0原文

我有一个第三方生成的CSV文件，我希望使用dbt seed上传到Google BigQuery。

我设法将其手动上传到BigQuery，但是我需要启用默认情况下的“引用Newlines”。

当我运行dbt seed时，我会收到以下错误：

16:34:43  Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43    Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.

CSV中有32列。该文件包含带有新线的列值。我想这就是DBT解析器失败的地方。我检查了 dbt种子配置选项，但我没有发现任何相关的内容。

有什么想法吗？

原文

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed.

I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.

When I run dbt seed, I get the following error:

16:34:43  Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43    Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.

There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离不开的别离 2025-01-27 15:08:58

据我所知 - 种子功能受到 dbt-core 内置功能的很大限制。所以种子不是我来这里的方式。您可以查看扩展种子选项的请求历史记录 here dbt-cre issues repo（包括我自己对类似可选性的请求#3990 ），但我必须看到这方面有任何真正的吸引力。

也就是说，对我来说非常有效的是将 gcp 项目中的平面文件存储在 gcs 存储桶中，然后利用 dbt-external-tables 包，用于非常相似但更强大的文件结构。我知道管理这个可能会产生很大的开销，但如果您的种子文件继续以可以利用分区的方式扩展，那么它就变得非常值得。

更重要的是 - 正如 Jeremy 在 stackoverflow 上的这个答案中提到的，

dbt-external-tables 包支持传递 BigQuery 外部表的选项，映射到此处记录了选项。

对于您的情况，应该是 quote 或 allowQuotedNewlines 选项。如果您确实选择使用dbt-external-tables，您的source.yml将类似于：

gcs.yml

version: 2

sources:
  - name: clickup
    database: external_tables
    loader: gcloud storage
  
    tables:
      - name: task
        description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
        external:
          location: 'gs://bucket/clickup/task/*'
          options:
            format: csv
            skip_leading_rows: 1
            quote: "\""
            allow_quoted_newlines: true

或非常类似的东西。
如果您最终选择此路径并将任务数据存储在日常分区上，例如 tasks_2022_04_16.csv - 您可以访问该文件名和其他元数据，所提供的伪列也由 Jeremy 在此处与我共享：

<一个href="https://stackoverflow.com/questions/66143582/retrieve-filename-from-gcp-storage-during-dbt-external-tables-sideload/66153656#66153656">在 dbt 期间从 GCP 存储中检索“文件名”-外部表侧载？

我发现它是一套非常强大的文件工具特别是 BigQuery。

As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.

That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.

And more importantly - as mentioned in this answer from Jeremy on stackoverflow,

The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here.

Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:

gcs.yml

version: 2

sources:
  - name: clickup
    database: external_tables
    loader: gcloud storage
  
    tables:
      - name: task
        description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
        external:
          location: 'gs://bucket/clickup/task/*'
          options:
            format: csv
            skip_leading_rows: 1
            quote: "\""
            allow_quoted_newlines: true

Or something very similar.
And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:

Retrieve "filename" from gcp storage during dbt-external-tables sideload?

I find it to be a very powerful set of tools for files specifically with BigQuery.

回复收藏 0 原文

~没有更多了~