将 csv 从 GCS 上传到 BigQuery 时,有没有办法提供架构或自动检测架构?

发布于 2025-01-17 21:41:11 字数 721 浏览 2 评论 0 原文

我正在尝试将 csv 文件从 Google Cloud Storage (GCS) 上传到 BigQuery (BQ) 并自动检测架构。

我尝试做的是启用自动检测架构并在“要跳过的标题行”选项中输入要跳过的行数。我有 6 行,其中包含有关我需要跳过的数据的描述性信息。第七行是我的实际标题行。

根据 Google 的文档: https://cloud.google.com/ bigquery/docs/schema-detect#auto-detect

“字段类型基于具有最多字段的行。因此,只要至少有一个,自动检测就应该按预期工作每个都有值的数据行列/字段。

我的 CSV 的问题在于,行中存在空值,因此不满足上述条件。

另外,我的 CSV 包含许多行,其中不包含任何数值,我认为这为 Google 模式自动检测增加了额外的复杂性。

自动检测未检测到正确的列名称或正确的字段类型。所有字段类型都被检测为字符串和列名称,如下所示: string_field_0 、 string_field_1 、 string_field_3 等。它还将 CSV 的列名称作为一行数据传递。

我想知道如何正确上传此 CSV 到 BQ,跳过前导不需要的行并拥有正确的架构(字段名称和字段类型)。

I am trying to upload a csv file from Google Cloud Storage (GCS) to BigQuery (BQ) and auto-detect schema.

What I tried to do is enable auto-detect schema and enter the number of rows to skip in "Header rows to skip" option. I have 6 rows which contain descriptive information about the data which I need to skip. The 7th row is my actual header row.

According to Google's documentation in: https://cloud.google.com/bigquery/docs/schema-detect#auto-detect:

"The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field."

The problem with my CSV is that the above condition is not met in the sense that I have nulls in the rows.

Also, my CSV contains many rows which do not include any numerical values which I think adds an extra complexity for Google's schema auto detection.

The auto detect is not detecting the correct column names nor the correct field types. All field types are being detected as strings and column names assigned as such: string_field_0 , string_field_1, string_field_3 ,...etc. It is also passing the column names of my CSV as a row of data.

I would like to know what I can do to correctly upload this CSV to BQ with skipping the leading unwanted rows and having the correct schema (field names and field types).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

感性 2025-01-24 21:41:12

阅读了一些文档后,特别是我认为的 CSV 标头部分您所观察到的是预期的行为。

另一种方法是手动指定数据的架构。

After reading some of the documentation, specifically the CSV header section I think what you're observing is the expected behavior.

An alternative would be to manually specify the schema for the data.

眼藏柔 2025-01-24 21:41:12

您可以尝试使用 bq load 作业中使用它。

You can try using tools like bigquery-schema-generator to generate the schema from your csv file and then use it in a bq load job for example.

儭儭莪哋寶赑 2025-01-24 21:41:12

通过将我的实际标题行在CSV中加入要跳过的行数来解决此问题。

我实际上需要跳过6行。第七行是我的标题(列名称)。我在标题行中输入6个要跳过。

当我输入7而不是6时,正确检测到该架构。

另外,我意识到在Google的文档中的这句话中,“字段类型基于行具有最多的行。因此,只要至少有一排的数据,自动检测应尽可能地工作。在每个列/字段中都有值。”,null被视为值,因此实际上并没有在上传到BQ中引起问题。

希望这有助于面对同一问题的人!

Solved this by including my actual header row in the csv in the number of rows to skip.

I had 6 rows I actually needed to skip. The 7th row was my header (column names). I was entering 6 in the Header rows to skip.

When I entered 7 instead of 6, the schema was auto detected correctly.

Also, I realized that in this sentence in Google's documentation: "The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field.", nulls are considered values and so that was not actually causing a problem in the upload to BQ.

Hope this helps someone facing the same issue!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文