在将S3的文件放入Snowflake之前,请从S3上解动文件
我在我们不拥有的S3存储桶中有可用的数据,每个日期都包含一个zipped文件夹。
我们将雪花作为数据仓库。 Snowflake接受Gzip'd文件,但不会摄入Zip'd文件夹。
是否有一种方法可以将文件直接摄入雪花,这比将它们全部复制到我们自己的S3存储桶中并将其解压缩更有效,然后将EG Snowpipe指向该桶?数据的每天为10GB,因此复制非常可行,但会引入(可能)不必要的延迟和成本。我们也无法访问他们的IAM政策,因此不能执行S3 Sync之类的事情。
我很乐意自己写东西,或者使用Meltano或Airbyte等产品/平台,但找不到合适的解决方案。
I have data available in an S3 bucket we don't own, with a zipped folder containing files for each date.
We are using Snowflake as our data warehouse. Snowflake accepts gzip'd files, but does not ingest zip'd folders.
Is there a way to directly ingest the files into Snowflake that will be more efficient than copying them all into our own S3 bucket and unzipping them there, then pointing e.g. Snowpipe to that bucket? The data is on the order of 10GB per day, so copying is very doable, but would introduce (potentially) unnecessary latency and cost. We also don't have access to their IAM policies, so can't do something like S3 Sync.
I would be happy to write something myself, or use a product/platform like Meltano or Airbyte, but I can't find a suitable solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如何使用SNOWSQL将数据加载到雪花上,并使用Snowflake阶段表/用户/命名阶段将文件保存在阶段。
How about using SnowSQL to load the data into Snowflake, and using Snowflake stage table/user/named stage to hold files at stages.
我有类似的用例。我使用基于的基于事件的触发器,该触发器每次在我的S3文件夹中都有一个新的zpipped文件。 lambda功能打开zpipted文件,gzips每个单独的文件,然后将它们重新上传到其他S3文件夹中。这是完整的工作代码: https:// https://更好的程序
I had a similar use case. I use an event based trigger that runs a Lambda function everytime there is a new zipped file in my S3 folder. The Lambda functions opens the zipped files, gzips each individual file and re-uploads them to a different S3 folder. Here's the full working code: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9