在GAE开发服务器上运行时如何快速将海量数据插入数据存储?
背景:
在GAE的本地开发Web服务器上编码时,用户需要上传Mega级数据并使用延迟库将它们存储(不是直接存储,而是需要许多格式检查和翻译)到Datastore中。
通常大约有 50,000 个实体,CSV 文件大小约为 5MB,我尝试使用延迟库每次插入 200 个实体。
我用的是蟒蛇。
问题
开发服务器非常慢,我需要等待一个/多个小时才能完成此上传过程。
我使用 --use_sqlite 选项来加速 Web 服务器的开发。
问题:
是否有其他方法或调整可以使其更快?
Background:
While coding on GAE's local Development Web Server, user need to upload Mega-level datas and store (not straight forward store, but need many format check and translate) them into Datastore using deferred library.
Usually about 50,000 entities, CSV File size is about 5MB, and I tried to insert 200 entities each time using deferred library.
And I used python.
Problem
The development server is really slow that I need to wait one/more hours to finish this upload process.
I used --use_sqlite option to speed up the development web server.
Question:
Is there any other method or tuning that can make it faster?
appengine-mapreduce 绝对是加载 CSV 文件的一个选项。使用 blobstore 上传 CSV 文件,然后设置
BlobstoreLineInputReader
映射器类型将数据加载到数据存储中。更多链接:此处提供了有关 MapReduce 读取器类型的 Python 指南。感兴趣的是BlobstoreLineInputReader。它需要的唯一输入是包含上传的 CSV 文件的 blobstore 记录的密钥。
appengine-mapreduce is definitely an option for loading CSV files. Use blobstore to upload CSV file and then setup
BlobstoreLineInputReader
mapper type to load data into datastore.Some more links: Python Guide to mapreduce reader types is here. The one of interest is BlobstoreLineInputReader. The only input it requires is the key to the blobstore record containing uploaded CSV file.