将 DataFrame 加载到 BigQuery 表时出错(pyarrow.lib.ArrowTypeError:类型为 的对象无法转换为 int)

发布于 2025-01-08 23:39:49 字数 2147 浏览 2 评论 0 原文

我在 GCS 中存储了一个 CSV 文件,我想将其加载到 BigQuery 表中。但我需要先做一些预处理,所以我将其加载到 DataFrame,然后加载到 BigQuery 表

import pandas as pd
import json 
from google.cloud import bigquery


cols_name_list = [....]. # column name in order
uri = "gs://<bucket>/<path>/<csv_file>"
df = pd.read_csv(uri, dtype="string")
df =df.reindex(columns=cols_name_list)

client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
   ... # added all schema field according to table column type
)

job = client.load_table_from_dataframe(
    df, "<bq_table_id>", job_config=job_config
)
job.result()

从上面的代码中,我重新排序 dataframe 列顺序以与 BigQuery 表中的顺序匹配(不确定这是否重要)并且将所有列转换为字符串类型。

我收到了如下所示的错误,其中

pyarrow.lib.ArrowInvalid: Could not convert '47803' with type str: tried to convert to int

我还运行了它,而没有强制 dtypes 为字符串,并且我收到了另一个错误

pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64

代码和数据看起来正常。因此,我尝试降级 numpy 和 pyarrow 的版本,但仍然导致相同的错误。

更新:

我更新了代码以强制仅使用字符串列

df =pd.read_csv(uri, dtype={"B" : "string"})

这是我使用的示例 CSV 数据,其中

A,B,C,D,E,F,G
47803,000000000020030263,629,,2021-01-12 23:26:37,,

BQ 表的列类型应如下所示

job_config = bigquery.LoadJobConfig(
    schema = [
        bigquery.SchemaField("A", "INTEGER"),
        bigquery.SchemaField("B", "STRING"),
        bigquery.SchemaField("C", "INTEGER"),
        bigquery.SchemaField("D", "INTEGER"),
        bigquery.SchemaField("E", "DATETIME"),
        bigquery.SchemaField("F", "INTEGER"),
        bigquery.SchemaField("G", "DATETIME")
    ]
)

现在,当我尝试使用 < 加载数据时code>load_table_from_dataframe() 使用这些配置,我收到此错误

pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int

因此,我打印出 dtypes

A      int64
B      string
C      int64
D      float64
E      object
F      float64
G      float64
dtype: object

现在哪一列是问题所在,如何解决此问题?该错误对于调试来说不太有用。因为应该是 int 的列已经是 int 了。唯一具有 string 类型的列不需要转换为 int 但会抛出这样的错误

I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table

import pandas as pd
import json 
from google.cloud import bigquery


cols_name_list = [....]. # column name in order
uri = "gs://<bucket>/<path>/<csv_file>"
df = pd.read_csv(uri, dtype="string")
df =df.reindex(columns=cols_name_list)

client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
   ... # added all schema field according to table column type
)

job = client.load_table_from_dataframe(
    df, "<bq_table_id>", job_config=job_config
)
job.result()

From code above, I reorder the dataframe column order to match with the order in BigQuery table (not sure if this matter or not) and convert all column to be string type.

I got this error as shown below in which

pyarrow.lib.ArrowInvalid: Could not convert '47803' with type str: tried to convert to int

I also ran it without forcing the dtypes to be string and I got another error

pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64

The code and data looks normal. So, I tried downgrading the version of numpy and pyarrow and still cause the same error.

Update:

I updated the code to force only string column

df =pd.read_csv(uri, dtype={"B" : "string"})

This is the example CSV data that I worked with

A,B,C,D,E,F,G
47803,000000000020030263,629,,2021-01-12 23:26:37,,

where column type of BQ table should be like this

job_config = bigquery.LoadJobConfig(
    schema = [
        bigquery.SchemaField("A", "INTEGER"),
        bigquery.SchemaField("B", "STRING"),
        bigquery.SchemaField("C", "INTEGER"),
        bigquery.SchemaField("D", "INTEGER"),
        bigquery.SchemaField("E", "DATETIME"),
        bigquery.SchemaField("F", "INTEGER"),
        bigquery.SchemaField("G", "DATETIME")
    ]
)

Now, when I'm trying to load data with load_table_from_dataframe() with these configs, I got this error

pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int

So, I print the dtypes out

A      int64
B      string
C      int64
D      float64
E      object
F      float64
G      float64
dtype: object

Which column that is the issue right now and how can I fix this? The error is not quite useful for debugging. Since the column that supposed to be int is already int. The only column with string type no need to be converted to int but the error thrown like that

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

宁愿没拥抱 2025-01-15 23:39:49

您必须将 source_format 设置为 LoadJobConfig 中源数据的格式。在这种情况下,您可以设置 autodetect=False,因为您已明确指定表的架构。下面是示例,您可以根据该示例更改代码并尝试执行。

job_config = bigquery.LoadJobConfig(
 
    schema = [
        bigquery.SchemaField("A", "INTEGER"),
        bigquery.SchemaField("B", "STRING"),
        bigquery.SchemaField("C", "INTEGER"),
        bigquery.SchemaField("D", "INTEGER"),
        bigquery.SchemaField("E", "DATETIME"),
        bigquery.SchemaField("F", "INTEGER"),
        bigquery.SchemaField("G", "DATETIME")
    ],
    autodetect=False,
    source_format=bigquery.SourceFormat.CSV
)

CSV 数据:

A,B,C,D,E,F,G
47803,000000000020030263,629,785,2021-01-12 23:26:37,986,2022-01-12 23:26:37

输出:

在此处输入图像描述

You have to set the source_format to the format of the source data inside your LoadJobConfig. In this case you can set autodetect=False as you have explicitly specified the schema of the table. Below is the sample according to which you can make changes in your code and try to execute.

job_config = bigquery.LoadJobConfig(
 
    schema = [
        bigquery.SchemaField("A", "INTEGER"),
        bigquery.SchemaField("B", "STRING"),
        bigquery.SchemaField("C", "INTEGER"),
        bigquery.SchemaField("D", "INTEGER"),
        bigquery.SchemaField("E", "DATETIME"),
        bigquery.SchemaField("F", "INTEGER"),
        bigquery.SchemaField("G", "DATETIME")
    ],
    autodetect=False,
    source_format=bigquery.SourceFormat.CSV
)

CSV Data:

A,B,C,D,E,F,G
47803,000000000020030263,629,785,2021-01-12 23:26:37,986,2022-01-12 23:26:37

Output:

enter image description here

绻影浮沉 2025-01-15 23:39:49

我面临同样的问题,但我正在尝试使用 pandas.to_bgq

这是我的代码:

#transform new_data to pandas dataframe. new_data is a JSON 

    new_data = pd.DataFrame(new_data, columns=['FechaHora', 'Fecha', 'Hora', 'Entrada', 'Grupo1', 'HourlyCountOne', 'HourlyCountTwo', 'Variacion', 'sumToAforo', 'DiaSemana', 'Tipo', 'NDiaSemana'])
    print("new_data.dtypes: ", new_data.dtypes)

    if new_data['FechaHora'].dtype == 'object':
        new_data['FechaHora'] = pd.to_datetime(new_data['FechaHora'], format='%Y-%m-%dT%H:%M:%S')

    new_data['HourlyCountOne'] = pd.to_numeric(new_data['HourlyCountOne'], errors='coerce').fillna(0).astype(int)
    new_data['HourlyCountTwo'] = pd.to_numeric(new_data['HourlyCountTwo'], errors='coerce').fillna(0).astype(int)
    new_data['Variacion'] = pd.to_numeric(new_data['Variacion'], errors='coerce').fillna(0).astype(int)
    new_data['NDiaSemana'] = pd.to_numeric(new_data['NDiaSemana'], errors='coerce').fillna(0).astype(int)
    
    print("new_data.dtypes: ", new_data.dtypes)

    #print new_data row by row
    for index, row in new_data.iterrows():
        #print the row as a single string
        print(f"{index}={row.to_string()}")
    
    pandas_gbq.to_gbq(new_data, 'project_id.dataset.table_id', project_id='project_id', if_exists='append')
    
    print("Loaded {} rows into {}.".format(len(new_data), 'project-id.dataset.tableid'))
    return "Loaded {} rows into {}.".format(len(new_data), 'project-id.dataset.tableid')

当我部署并运行该函数时,我收到错误:

pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int

我无法找到导致的列或值的问题。

这是 bigquery 表架构:

FechaHora:DATETIME,
Fecha:DATE,
Hora:TIME,
Entrada:STRING,
Grupo1:STRING,
HourlyCountOne:INTEGER,
HourlyCountTwo:INTEGER,
Variacion:INTEGER,
sumToAforo:BOOLEAN,
DiaSemana:STRING,
Tipo:STRING,
NDiaSemana:INTEGER

这是 new_data 的打印数据类型

FechaHora datetime64[ns]
Fecha object 
Hora object 
Entrada object 
Grupo1 object 
HourlyCountOne int64 
HourlyCountTwo int64 
Variacion int64 
sumToAforo bool 
DiaSemana object 
Tipo object 
NDiaSemana int64

这是我打印 pandas 数据帧第一行时的输出:

0=FechaHora 2023-09-27 04:00:00
Fecha 2023-09-27
Hora 04:00:00
Entrada Entrada Motos
Grupo1 Entrada Motos
HourlyCountOne 9
HourlyCountTwo 0
Variacion 9
sumToAforo False
DiaSemana Miércoles
Tipo Moto
NDiaSemana 4

I'm facing the same issue, but I'm trying to use pandas.to_bgq

This is my code:

#transform new_data to pandas dataframe. new_data is a JSON 

    new_data = pd.DataFrame(new_data, columns=['FechaHora', 'Fecha', 'Hora', 'Entrada', 'Grupo1', 'HourlyCountOne', 'HourlyCountTwo', 'Variacion', 'sumToAforo', 'DiaSemana', 'Tipo', 'NDiaSemana'])
    print("new_data.dtypes: ", new_data.dtypes)

    if new_data['FechaHora'].dtype == 'object':
        new_data['FechaHora'] = pd.to_datetime(new_data['FechaHora'], format='%Y-%m-%dT%H:%M:%S')

    new_data['HourlyCountOne'] = pd.to_numeric(new_data['HourlyCountOne'], errors='coerce').fillna(0).astype(int)
    new_data['HourlyCountTwo'] = pd.to_numeric(new_data['HourlyCountTwo'], errors='coerce').fillna(0).astype(int)
    new_data['Variacion'] = pd.to_numeric(new_data['Variacion'], errors='coerce').fillna(0).astype(int)
    new_data['NDiaSemana'] = pd.to_numeric(new_data['NDiaSemana'], errors='coerce').fillna(0).astype(int)
    
    print("new_data.dtypes: ", new_data.dtypes)

    #print new_data row by row
    for index, row in new_data.iterrows():
        #print the row as a single string
        print(f"{index}={row.to_string()}")
    
    pandas_gbq.to_gbq(new_data, 'project_id.dataset.table_id', project_id='project_id', if_exists='append')
    
    print("Loaded {} rows into {}.".format(len(new_data), 'project-id.dataset.tableid'))
    return "Loaded {} rows into {}.".format(len(new_data), 'project-id.dataset.tableid')

when I deploy and run the function, I'm getting the error:

pyarrow.lib.ArrowTypeError: object of type <class 'str'> cannot be converted to int

I'm unable to find what column or value is causing the issue.

This is the bigquery table schema:

FechaHora:DATETIME,
Fecha:DATE,
Hora:TIME,
Entrada:STRING,
Grupo1:STRING,
HourlyCountOne:INTEGER,
HourlyCountTwo:INTEGER,
Variacion:INTEGER,
sumToAforo:BOOLEAN,
DiaSemana:STRING,
Tipo:STRING,
NDiaSemana:INTEGER

This is the printed dtypes of new_data

FechaHora datetime64[ns]
Fecha object 
Hora object 
Entrada object 
Grupo1 object 
HourlyCountOne int64 
HourlyCountTwo int64 
Variacion int64 
sumToAforo bool 
DiaSemana object 
Tipo object 
NDiaSemana int64

This is the output when I print the first row of the pandas dataframe:

0=FechaHora 2023-09-27 04:00:00
Fecha 2023-09-27
Hora 04:00:00
Entrada Entrada Motos
Grupo1 Entrada Motos
HourlyCountOne 9
HourlyCountTwo 0
Variacion 9
sumToAforo False
DiaSemana Miércoles
Tipo Moto
NDiaSemana 4
桜花祭 2025-01-15 23:39:49

具体来说,它可能不是您问题的解决方案,但我也面临同样的问题,并意识到 .to_gbd() 检索原始表模式。显然,删除以前创建的表可以解决任何数据类型的此问题。

It may not be a solution for your issue, specifically, but I've face the same problem and realized that the .to_gbd() retrieves the original table schema. Deleting the previously created table solves this issue to any dtype, apparently.

怎言笑 2025-01-15 23:39:49

我对 load_table_from_dataframe 也有同样的问题。我想有一些问题需要检查,因为它仅在所有 3 个都通过时才有效(架构、自动检测设置为 False、source_format 设置为 bigquery.SourceFormat.CSV)。我不想再次复制该架构,因为我只想将其维护在我的地形中的一个位置。

我更改了代码,如下所示,以读取现有的 BigQuery 架构并在写入期间将其传回。

# Drop index from the pandas df
write_df.reset_index(drop=True, inplace=True)

# Get the dataset and table ref
dataset_ref = client.dataset("DATASET")
table_ref = dataset_ref.table("TABLE")

# Set the job config
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
job_config.schema = client.get_table(table_ref).schema
job_config.autodetect = False
job_config.source_format = bigquery.SourceFormat.CSV

# Write to BQ
load_job = client.load_table_from_dataframe(
        write_df, table_ref,
        job_config=job_config)

I had the same issue with load_table_from_dataframe. I guess there is some issue which needs to be checked since it only works when all 3 are passed (schema, autodetect set to False, source_format set to bigquery.SourceFormat.CSV). I didn't want to replicate the schema again since I want to maintain it only at one place in my terraform.

I changed my code like below to read the existing BigQuery schema and pass it back during the write.

# Drop index from the pandas df
write_df.reset_index(drop=True, inplace=True)

# Get the dataset and table ref
dataset_ref = client.dataset("DATASET")
table_ref = dataset_ref.table("TABLE")

# Set the job config
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
job_config.schema = client.get_table(table_ref).schema
job_config.autodetect = False
job_config.source_format = bigquery.SourceFormat.CSV

# Write to BQ
load_job = client.load_table_from_dataframe(
        write_df, table_ref,
        job_config=job_config)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文