如何在 Google AppEngine 中汇总数据

发布于 2024-10-27 03:57:33 字数 984 浏览 4 评论 0 原文

我正在尝试使用 AppEngine 对大型数据集实现摘要视图。

我的模型如下所示：

def TxRecord(db.Model):
    expense_type = db.StringProperty()
    amount = db.IntegerProperty()

def ExpenseType(db.Model):
    name = db.StringProperty()
    total = db.IntegerProperty()

我的数据存储区包含 100K 个 TxRecord 实例，我想按 expense_type 总结这些实例。

在 sql 中，它会类似于：

select expense_type as name, sum(amount) as total 
    from TxRecord
    group by expense_type

我当前正在做的是使用 Python MapReduce 框架使用以下映射器迭代所有 TxRecords：

def generate_expense_type(rec):
    expense_type = type.get_or_insert(name, name = rec.expense_type)
    expense_type.total += rec.amount

    yield op.db.Put(expense_type)

这似乎可行，但我觉得我必须使用 1 的 shard_count 来运行它，以确保总数不会被并发写入覆盖。

是否有一种策略可以让我使用 AppEngine 来解决这个问题？或者就是这样吗？

原文

I'm trying to implement a summary view on to a large(ish) data set using AppEngine.

My model looks something like:

def TxRecord(db.Model):
    expense_type = db.StringProperty()
    amount = db.IntegerProperty()

def ExpenseType(db.Model):
    name = db.StringProperty()
    total = db.IntegerProperty()

My datastore contains 100K instances of TxRecord and I'd like to summarise these by expense_type.

In sql it would be something like:

select expense_type as name, sum(amount) as total 
    from TxRecord
    group by expense_type

What I'm currently doing is using the Python MapReduce framework to iterate over all of the TxRecords using the following mapper:

def generate_expense_type(rec):
    expense_type = type.get_or_insert(name, name = rec.expense_type)
    expense_type.total += rec.amount

    yield op.db.Put(expense_type)

This seems to work, but I feel I have to run it using a shard_count of 1 in order to ensure that the total isn't over written with concurrent writes.

Is there a strategy that I can use to over come this issue using AppEngine or is that it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两人的回忆 2024-11-03 03:57:33

使用mapreduce是正确的方法。正如 David 所建议的，计数器是一种选择，但它们并不可靠（它们使用内存缓存），而且它们并不是为并行保持大量计数器而设计的。

您当前的 mapreduce 有几个问题：首先，get_or_insert 每次调用时都会执行数据存储事务。其次，您然后在交易外部更新金额并再次异步存储它，从而产生您担心的并发问题。

至少在完全支持reduce之前，最好的选择是在事务中的映射器中进行整个更新，如下所示：

def generate_expense_type(rec):
    def _tx():
      expense_type = type.get(name)
      if not expense_type:
        expense_type = type(key_name=name)
      expense_type.total += rec.amount
      expense_type.put()
    db.run_in_transaction(expense_type)

Using mapreduce is the right approach. Counters, as David suggests, are one option, but they're not reliable (they use memcache), and they're not designed for massive numbers of counters to be kept in parallel.

Your current mapreduce has a couple of issues: First, get_or_insert executes a datastore transaction every time it's called. Second, you then update the amount outside the transaction and asynchronously store it a second time, generating the concurrency issue you were concerned about.

At least until reduce is fully supported, your best option is to do the whole update in the mapper in a transaction, like this:

def generate_expense_type(rec):
    def _tx():
      expense_type = type.get(name)
      if not expense_type:
        expense_type = type(key_name=name)
      expense_type.total += rec.amount
      expense_type.put()
    db.run_in_transaction(expense_type)

回复收藏 0 原文

凝望流年 2024-11-03 03:57:33

使用 MapReduce 框架是一个好主意。如果您利用 MapReduce 框架提供的计数器，则可以使用多个分片。因此，您不必每次都修改数据存储，而是可以执行以下操作：

yield op.counters.Increment("total_<expense_type_name>", rec.amount)

MapReduce 完成后（希望比仅使用一个分片时快得多），然后您可以将最终的计数器复制到数据存储实体中。

Using the MapReduce framework is a good idea. You could use more than one shard if you utilize the counters provided by the MapReduce framework. So instead of modifying the datastore each time, you could do something like this:

yield op.counters.Increment("total_<expense_type_name>", rec.amount)

After the MapReduce finishes (hopefully much more quickly than when you were using just one shard), then you can copy the finalized counters into your datastore entity.

回复收藏 0 原文