google appengine mapper - 映射日期范围
我想使用 appengine 映射器来迭代一系列日期(起始日期和截止日期作为属性传递给配置)。对于范围内的每个日期,我将检索将该日期作为属性的实体并对此集进行操作。
例如,如果我有以下一组实体:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
我想指定日期范围 2011/09/09 - 2011/09/07,这将创建三个映射器实例,分别为 2011/09/09、2011/09/ 08 和 2011/09/07。反过来,它们将分别查询实体 a+b、c+d+e 和 f+g,并对值执行一些操作。 (每个映射器还会对其他数据存储查询额外的数据,因此下面的“额外问题”)
大概我需要创建一个自定义 InputFormat
类,但是我对 mapreduce/hadoop 还很陌生我希望有人能举出一些例子?
额外问题:使用 dao 在映射器中加载数据是“不好的形式”吗?我使用过的其他分布式计算平台(例如 DataSynapse)需要您将所有输入打包并提供任务以防止数据服务器上出现过多争用。但是,对于 appengine HR 数据存储,我认为这不是一个问题?
I would like to use the appengine mapper to iterate over a range of dates (from-date and to-date passed as properties to the configuration). For each date in the range, I would retrieve the entities that have this date as a property and operate on this set.
For example, if I have the following set of entities:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
I would like to specify a date range of 2011/09/09 - 2011/09/07 which would create three mapper instances, for 2011/09/09, 2011/09/08 and 2011/09/07. In turn these would query for entities a+b, c+d+e and f+g respectively, and perform some operations on the values. (Each of the mappers would also make other datastore queries for additional data, hence the 'bonus question' below)
Presumably I need to create a custom InputFormat
class, however I'm quite new to mapreduce/hadoop and I was hoping someone had some examples?
Bonus question: is it "bad form" to use a dao to load data in a mapper? Other distributed computing platforms I have worked with (eg DataSynapse) would require that you parcel all inputs up and provide with the task to prevent too much contention on a dataserver. However, with the appengine HR datastore I presume this isn't a concern?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
目前无法在 App Engine 的映射缩减实现中迭代给定类型的实体子集。如果实体占数据的很大一部分,您可以简单地迭代所有内容并忽略不需要的实体;如果它们只占一小部分,您将不得不使用任务队列来滚动您自己的更新过程。
It's not currently possible to iterate over a subset of entities of a given kind in App Engine's mapreduce implementaiton. If the entities make up a large proportion of the data, you can simply iterate over everything and ignore the unwanted entities; if they only make up a small proportion, you will have to roll-your-own update procedure using the task queue.
根据 Nick Johnson 的回答,您将需要使用自定义参数从上下文中检索日期范围。然后映射器在处理之前过滤掉(忽略)超出范围的实体。
但是,如果您坚持映射给定类型的所有实体,那么有一个解决方案,根据您的要求,该解决方案可能可行也可能不可行。假设您对日期范围非常固定(听起来不太可能,但只是可能)。然后,对于每个预期范围,您创建相应的子实体类型,并使用指向主实体的父键(或者只是一个引用,但父键更适合一致性 - 考虑跨实体组的事务)。
因此,该范围中的每个实体都会接收与该范围相对应的种类的子实体。然后在与范围相对应的子实体种类上设置一个映射器,并检索其父实体以对其进行处理。
在填充关系索引实体模式的数据时,我做了一些类似的事情,但方向相反,并且对于单个子实体类型。因此,你的奖金问题的答案 - 继续使用 dao 或任何你的数据层组成的东西。
虽然第一种方法更合理,但在您的范围不是很动态且易于管理的情况下,后者可能是可行的。鉴于数据存储的无模式性质,创建新的实体类型既不昂贵也不糟糕。
Based on Nick Johnson answer you will need to retrieve your date range from the context using custom parameters. Then mapper filters out (ignores) entity that falls out of range before processing it.
But if you insist on mapping across all entities of a given kind then there is a workaround solution that depending on your requirements may or may not be feasible. Suppose that you are pretty fixed on the date ranges (sounds unlikely but just maybe). Then for each expected range you create corresponding child entity kind with a parent key (or just a reference but parent key works better for consistency - think transaction across entity group) pointing to the main entity.
Thus each entity from the range receives a child entity of the kind corresponding to this range. Then setup a mapper on the child entity kind corresponding the range and retrieve its parent to work on it.
I do somewhat similar but in opposite direction and for single child entity kind when populating my data for Relation Index Entity pattern. Hence, the answer to your bonus question - go ahead use dao or whatever your data layer consists of.
While first approach is more sound, the latter may be feasible in cases when your ranges are not very dynamic and manageable. Given schema-less nature of the datastore creating new entity kinds is neither expensive nor a bad practice.