大查询表作为Kubeflow管道中的工件
我正在KubeFlow中运行自定义组件,以进行一些数据操作,然后将结果保存为一个大查询表。如何将桌子注册为工件,以便将其传递到管道的不同阶段?
最终,我计划使同行者创建多个BigQuery表,从中我将创建多个机器学习模型。我希望能够将这些表传递到下一阶段,以便可以从它们创建模型。
目前,我正在做的只是将URI保存到Pandas DataFrame中。
def get_the_data(
project_id: str,
url: str,
dataset_uri: Output[Dataset],
lag: int = 0,
):
## table name
table_id = url + "_lag_" + str(lag)
## code to query and create new table
##
##
## store URI in a dataframe which can be passed to next stage
df=pd.DataFrame(data=[table_id], columns = ['path'])
df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')
最终,我将使用ParallelFor OP并行多次运行此组件并创建多个表。我不知道如何管理和收集表ID,因此我可以对它们进行后续操作。
I am running a custom component in kubeflow to do some data manipulation and then save the result as a big query table. How do I register the table as an artifact so that I can pass it down to the different stages of the pipeline?
Eventually I am planning on making a parallelfor up to create multiple bigquery tables from which i will create multiple machine learning models. I would like to be able to pass these tables to the next stage so that I can create models from them.
Currently what i am doing is just saving the uri into a pandas dataframe.
def get_the_data(
project_id: str,
url: str,
dataset_uri: Output[Dataset],
lag: int = 0,
):
## table name
table_id = url + "_lag_" + str(lag)
## code to query and create new table
##
##
## store URI in a dataframe which can be passed to next stage
df=pd.DataFrame(data=[table_id], columns = ['path'])
df.to_csv(dataset_uri.path + ".csv" , index=False, encoding='utf-8-sig')
Eventually i am going to be using a parallelfor op to run this component multiple times in parallel and create multiple tables. I don't know how to manage and collect the table ids so i can run subsequent ops on them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论