ETL大量数据文件从S3到雪花

发布于 2025-01-29 21:11:59 字数 502 浏览 6 评论 0原文

我正在尝试构建一条管道，该管道读取来自S3的大型文件，并在某种转换逻辑后将其倒入雪花。 S3中的文件是非结构化的，具有不同的类型和列数。

e.g.
clmA|clmB|clmC
clm1|clm2

目前，我正在使用Python脚本读取数据，将数据划分为两个不同的pandas dataframes（onw带有列Clma | clmb | clmc，另一个带有列Clm1 | clm2），并使用pandas write> write_pandas snowflake.connector.pandas_tools。

它可以正常工作，但问题是，如果任何巨大的文件失败，我都必须从一开始就再次阅读并处理该文件。我尝试了运动型数据流，它使我能够在块中阅读和处理数据，但仍无法解决问题。 AWS胶水提供工作书签，但是我如何使用Python将数据放在胶水上，然后才能进行雪花我是数据工程的新手，所以我很想听听社区推荐的道路是什么。

原文

I am trying to build a pipeline that reads huge files from s3 and dump it into snowflake after some transformation logic. Files in s3 are unstructured with different types and numbers of column.

e.g.
clmA|clmB|clmC
clm1|clm2

Currently I am reading data using python script dividing the data in two different pandas dataframes (onw with columns clmA|clmB|clmC and another with columns clm1|clm2) and inserting to snowflake tables using pandas write_pandas method provided by snowflake.connector.pandas_tools.

It works fine but the problem is if any huge file fails I have to read and process that file again from the start. I tried kinesis Data Stream which allows me to read and process data in chunks but still doesn't solve above problem.
AWS Glue provide job bookmarks but how can I put data to glue using python and then to snowflake
I am new to Data Engineering, so I would love to hear from the community what the recommended path is.

分享到QQ

分享到微博