大查询表作为Kubeflow管道中的工件

发布于 2025-02-07 19:54:46 字数 709 浏览 4 评论 0原文

我正在KubeFlow中运行自定义组件，以进行一些数据操作，然后将结果保存为一个大查询表。如何将桌子注册为工件，以便将其传递到管道的不同阶段？

最终，我计划使同行者创建多个BigQuery表，从中我将创建多个机器学习模型。我希望能够将这些表传递到下一阶段，以便可以从它们创建模型。

目前，我正在做的只是将URI保存到Pandas DataFrame中。

def get_the_data(
    project_id: str,
    url: str,
    dataset_uri: Output[Dataset],
    lag: int = 0, 
):
  ## table name
  table_id = url + "_lag_"  + str(lag)
  ## code to query and create new table
  ##
  ##
  
  ## store URI in a dataframe which can be passed to next stage
  df=pd.DataFrame(data=[table_id], columns = ['path'])
  df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')

最终，我将使用ParallelFor OP并行多次运行此组件并创建多个表。我不知道如何管理和收集表ID，因此我可以对它们进行后续操作。

原文

I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?

Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.

Currently what i am doing is just saving the uri into a pandas dataframe.

def get_the_data(
    project_id: str,
    url: str,
    dataset_uri: Output[Dataset],
    lag: int = 0, 
):
  ## table name
  table_id = url + "_lag_"  + str(lag)
  ## code to query and create new table
  ##
  ##
  
  ## store URI in a dataframe which can be passed to next stage
  df=pd.DataFrame(data=[table_id], columns = ['path'])
  df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')

Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.

分享到QQ

分享到微博