在 App Engine 上执行批量 db.delete,而不占用 CPU

发布于 2024-10-07 20:34:47 字数 2061 浏览 11 评论 0原文

我们在 Google App Engine 上有一个相当大小的数据库 - 刚刚超过 50,000 个实体 - 我们希望从中清除陈旧数据。我们的计划是编写一个延迟任务来迭代我们不再需要的实体,并批量删除。

一个复杂的情况是,我们的实体也有我们也想清除的子实体——我们认为没问题;我们只需在数据存储中查询这些实体,并与父实体同时删除它们:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

我们限制自己一次只能删除 100 个 ParentKind 实体;每个 ParentKind 大约有 40 个子 ChildKindAChildKindB 实体,总共大约 4000 个实体。

这在当时看来是合理的,但我们运行了一批作为测试,结果查询运行了 9 秒,并在访问数据存储时花费了 1933 秒的计费 CPU 时间。

这看起来相当苛刻——每个实体 0.5 秒计费! ——但我们并不完全确定我们做错了什么。仅仅是批次的大小吗?祖先查询是否特别慢?或者,删除(实际上是所有数据存储访问)是否像糖蜜一样慢?

更新

我们将查询更改为 keys_only,虽然这将运行一批的时间减少到 4.5 秒,但仍然花费了约 1900 秒的 CPU 时间。

接下来,我们将 Appstats 安装到我们的应用程序中(感谢 kevpie)并运行较小规模的批次 - 10 个父实体,总共约 450 个实体。以下是更新后的代码:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

Appstats 的结果:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

Delete 调用是操作中最昂贵的部分!

有办法解决这个问题吗? Nick Johnson 提到,使用批量删除处理程序是目前最快的删除方法,但理想情况下我们不想删除某种类型的所有实体,而只想删除与我们的初始 bar = foo 匹配并且是其子级的实体代码>查询。

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.

One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.

This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.

This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?

Update

We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.

Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

The results from Appstats:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

The Delete call is the single most expensive part of the operation!

Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风情万种。 2024-10-14 20:34:47

我们最近添加了一个批量删除处理程序,此处进行了记录。它采用最有效的方法进行批量删除,但仍会消耗 CPU 配额。

We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.

关于从前 2024-10-14 20:34:47

如果您想分散 CPU 消耗,可以创建一个 map reduce 作业。它仍然会迭代每个实体(这是映射器 API 的当前限制)。但是,您可以检查每个实体是否满足条件并当时删除。

要降低 CPU 使用率,请将映射器分配到已配置为比正常运行速度慢的任务队列。您可以将运行时间分散在几天内,而不会耗尽所有 CPU 配额。

If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.

To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文