如何删除 GAE 上的 feed 中未找到的实体
我正在更新源中的项目(可以包含大约 40000 个项目)并将项目一次添加到数据存储 200 个项目,问题是源可能会更改,并且某些项目可能会从源中删除。 我有以下代码:
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
def updateFeed(offset, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name)
)
db.put(feedEntriesToAdd)
如何找出哪些项目不在提要中并将其从数据存储中删除? 我考虑过创建一个项目列表(在数据存储中),然后从其中删除我更新的所有项目,剩下的将是要删除的项目。 - 但这似乎相当慢。
PS:所有 item.id 对于该 feed 项目来说都是唯一的并且是一致的。
I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed.
I have this code:
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
def updateFeed(offset, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name)
)
db.put(feedEntriesToAdd)
How do I find out which items were not in the feed and delete them from the datastore?
I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. - but that seems rather slow.
PS: All item.id are unique for that feed item and are consistent.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果添加带有
auto_now=True
的 DateTimeProperty,它将记录每个实体的最后修改时间。由于您更新了 feed 中的每个项目,因此当您完成时,它们的时间都会晚于您开始的那一刻,因此日期在此之前的任何内容都不再出现在 feed 中。Xavier 的生成计数器也同样好 - 我们所需要的只是保证在刷新之间增加,并且在刷新期间永远不会减少。
从文档中不确定,但我预计 DateTimeProperty 比 IntegerProperty 更大。后者是一个 64 位整数,因此它们可能大小相同,或者 DateTimeProperty 可能存储多个整数。 群组帖子建议也许它是 10 个字节,而不是 8 个字节。
但请记住,通过添加一个用于查询的额外属性,您无论如何都会添加另一个索引,因此,字段大小的差异被稀释为开销的一部分。此外,即使按 0.24 美元/G/月计算,40k 乘以几个字节也不算多。
对于生成或日期时间,您不一定需要立即删除数据。您的其他查询可以根据最近刷新的日期/生成进行过滤,这意味着您不必立即删除数据。如果提要(或您对它的解析)变得很有趣并且无法生成任何项目,或者只生成一些项目,则将上次刷新作为备份可能会很有用。完全取决于应用程序是否值得拥有。
If you add a DateTimeProperty with
auto_now=True
, it will record the last modified time of each entity. Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more.Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh.
Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. A group post suggests maybe it's 10 bytes as opposed to 8.
But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. Further, 40k times a few bytes isn't much even at $0.24/G/month.
With either a generation or a datetime, you don't necessarily have to delete the data immediately. Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. Depends entirely on the app whether it's worth having.
我会添加一个生成计数器
I would add a generation counter