在合理的时间内将数千个实体插入 BigTable

发布于 2024-11-15 12:42:22 字数 820 浏览 3 评论 0原文

当我尝试将 36k 法国城市插入 BigTable 时遇到一些问题。我正在解析一个 CSV 文件,并使用这段代码将每一行放入数据存储中:

import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery

def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
    region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
    mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)

使用本地开发服务器大约需要 5 分钟(!!!),使用 db.delete() 删除它们时甚至需要 10 分钟功能。 当我在线尝试调用包含 add_cities() 的 test.py 页面时,达到了 30 秒超时。 我来自 MySQL 世界,我认为如果不在不到一秒的时间内添加 36k 个实体,那真是太遗憾了。我的做法可能是错误的,所以我指的是你:

  • 为什么这么慢?
  • 有什么办法可以在合理的时间内做到这一点吗?

谢谢 :)

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:

import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery

def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
    region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
    mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)

It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:

  • Why is it so slow ?
  • Is there any way to do it in a reasonnable time ?

Thanks :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦醒灬来后我 2024-11-22 12:42:22

首先,这是数据存储,而不是 Bigtable。数据存储使用bigtable,但它在此基础上添加了更多内容。

进展如此缓慢的主要原因是您正在为添加的每条记录进行查询(在“区域”类型上)。这不可避免地会大大减慢速度。您可以采取两件事来加快速度:

  • 使用 Regioncode 作为其 key_name,使您能够更快地获取数据存储而不是查询。事实上,由于您只需要引用属性的区域键,因此在这种情况下根本不需要获取区域。
  • 将区域列表缓存在内存中,或者根本不将其存储在数据存储中。就其本质而言,我猜测区域既是一个小列表,又不经常更改,因此可能不需要首先将其存储在数据存储中。

此外,加载大量数据时,您应该使用 mapreduce 框架以避免超时。它还内置支持从 blobstore blob 读取 CSV。

First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.

The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:

  • Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
  • Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.

In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.

活泼老夫 2024-11-22 12:42:22

使用任务队列。如果您希望快速处理数据集,请让上传处理程序使用偏移值为 500 个数据集的每个子集创建一个任务。

Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.

半葬歌 2024-11-22 12:42:22

FWIW 我们使用 mapreduce 将大型 CSV 处理到数据存储中,并在任务中进行一些初始处理/验证。目前,即使任务也有限制(10 分钟),但这对于您的数据大小来说可能没问题。

确保您是否正在执行插入等操作。您尽可能多地进行批处理 - 不要插入单独的记录,对于查找也是如此 - get_by_keyname 允许您传入一个键数组。 (我相信 db put 目前有 200 条记录的限制?)

Mapreduce 对于您现在正在做的事情来说可能有点大材小用,但它绝对值得您认真思考,它是较大数据集的必备条件。

最后,SDK 上任何内容的计时在很大程度上都是毫无意义的 - 将其视为调试器而不是其他任何东西!

FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.

Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)

Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.

Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文