在 App Engine 上执行批量 db.delete，而不占用 CPU

发布于 2024-10-07 20:34:47 字数 2061 浏览 11 评论 0原文

我们在 Google App Engine 上有一个相当大小的数据库 - 刚刚超过 50,000 个实体 - 我们希望从中清除陈旧数据。我们的计划是编写一个延迟任务来迭代我们不再需要的实体，并批量删除。

一个复杂的情况是，我们的实体也有我们也想清除的子实体——我们认为没问题；我们只需在数据存储中查询这些实体，并与父实体同时删除它们：

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

我们限制自己一次只能删除 100 个 ParentKind 实体；每个 ParentKind 大约有 40 个子 ChildKindA 和 ChildKindB 实体，总共大约 4000 个实体。

这在当时看来是合理的，但我们运行了一批作为测试，结果查询运行了 9 秒，并在访问数据存储时花费了 1933 秒的计费 CPU 时间。

这看起来相当苛刻——每个实体 0.5 秒计费！ ——但我们并不完全确定我们做错了什么。仅仅是批次的大小吗？祖先查询是否特别慢？或者，删除（实际上是所有数据存储访问）是否像糖蜜一样慢？

更新

我们将查询更改为 keys_only，虽然这将运行一批的时间减少到 4.5 秒，但仍然花费了约 1900 秒的 CPU 时间。

接下来，我们将 Appstats 安装到我们的应用程序中（感谢 kevpie）并运行较小规模的批次 - 10 个父实体，总共约 450 个实体。以下是更新后的代码：

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

Appstats 的结果：

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

Delete 调用是操作中最昂贵的部分！

有办法解决这个问题吗？ Nick Johnson 提到，使用批量删除处理程序是目前最快的删除方法，但理想情况下我们不想删除某种类型的所有实体，而只想删除与我们的初始 bar = foo 匹配并且是其子级的实体代码>查询。

原文

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.

One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.

This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.

This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?

Update

We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.

Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

The results from Appstats:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

The Delete call is the single most expensive part of the operation!

Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

分享到QQ

分享到微博