将大量数据从 Snowflake 数据库提取到 AWS SageMaker 的最快方法是什么?
从Snowflake中将非常大的数据集从AWS中吸入我的SageMaker实例的最快方法是什么?雪花python连接器(我目前使用的内容)与雪花连接器相比如何?
What would be the fastest way to pull in very large datasets from Snowflake into my SageMaker instance in AWS? How does the snowflake python connector (what I currently use) compare to lets say a spark connector to snowflake?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
SageMaker 训练作业(例如 S3)作为输入源,但您也可以使用 EFS (NFS) 或 FSx for Lustre,以获得更高的性能
对于 S3,我会使用 AWS Glue 到从 Snowflake 读取或在 EMR 上使用 Spark,并将数据存储在 S3 的分区中。如果您的算法支持,分区将允许您将训练分布在多台机器上。
还有
,您会以 Parquet 格式存储,但 [gzipped] CSV 是 SageMaker 内置算法的常用格式。如果您使用自己的算法,那么可能会使用 Parquet
如果您正在进行预测,您也可以使用 亚马逊预测,但它可能会变得昂贵
SageMaker training jobs like S3 as the input source, but you can also use EFS (NFS) or FSx for Lustre, for higher performance
For S3, I'd use AWS Glue to read from Snowflake or use Spark on EMR, and store the data in partitions in S3. Partitioning would allow you to distribute your training across multiple machines, if your algorithm supports it
There's also
copy into
in SnowflakeIdeally, you'd store in Parquet format, but [gzipped] CSV is the common format for SageMaker built-in algorithms. If you're using your own algorithm, then probably go with Parquet
If you're doing forecasting, you could also use Amazon Forecast, but it can get pricey