如何将非常大的CSV导入DynamoDB?
因此,我的S3数据库中有很大的CSV文件(200万+行),我想将其导入DynamoDB。
我尝试的是:
lambda 我设法使lambda功能正常工作,但是在我的功能超时后,只有大约120k行进口到DDB。
管道 使用管道时,它被卡在“等待跑步者”上,然后完全停止
So I have very large csv file in my s3 database (2 mil+ lines) and I want to import it to dynamodb.
What I tried:
Lambda
I manage to get the lambda function to work, but only around 120k lines were imported to ddb after my function being timed out.Pipeline
When using pipeline it got stuck on "waiting for runner" followed by it stopping completely
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一种无服务器的方法,可以在带有2个lambdas和一个SQS队列的小块中处理大型
.csv
:.csv
到位。请参阅 selectObjectContcontent API API。.csv
使用S3选择:select * where * wery s.primary_key中获取其主要键的记录,in('id1','id2','id2','id3')从S3Object S
Here's a serverless approach to process the large
.csv
in small chunks with 2 Lambdas and a SQS Queue:SELECT s.primary_key FROM S3Object s
, querying the.csv
in place. See the SelectObjectContent API for details..csv
using S3 Select:SELECT * WHERE s.primary_key IN ('id1', 'id2', 'id3') FROM S3Object s
您可以设置外部EMR表(或者也许是Athena,因此您不需要EMR群集),一个用于S3文件,使用 dynamyodbstoragehandler 连接器。它支持将数据从DynamoDB到S3的复制,还仅通过在表之间运行和选择来将数据从S3复制到DynamoDB。
设置外部S3文件表的示例OS将是:
要设置DynamoDB外部表是:
然后从S3复制到DynamoDB:
覆盖物使其成为DynamoDB上的任何冲突记录(使用相同的PK和SK)都会覆盖通过插入的新数据。
You could setup external EMR tables (or maybe Athena so you'd not need an EMR cluster), one for the S3 files and one for the DynamoDb table using the DynamoDbStorageHandler connector. It supports copying data from DynamoDB to S3 and also from S3 to DynamoDB just by running inserts and selects between the tables.
An example os setting up an external S3 file table would be
And to setup the DynamoDB external table would be:
And then to copy from S3 to DynamoDB:
The OVERWRITE makes it so any conflicting records on DynamoDB (with same PK and SK) get overwritten by the new data being inserted.